[GH-ISSUE #5360] Support for Snapdragon X Elite NPU & GPU #3357

New Issue

GiteaMirror · 2026-04-12T13:58:21-05:00

GiteaMirror commented

2026-04-12 13:58:21 -05:00

Originally created by @flyfox666 on GitHub (Jun 28, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5360

Originally assigned to: @dhiltgen on GitHub.

Hi all.

I just got a Microsoft laptop7, the AIPC, with Snapdragon X Elite, NPU, Adreno GPU. It is an ARM based system.

But I found that NPU is not running when using Ollama.

Would it be supported by Ollama for the NPU and GPU?

Originally created by @flyfox666 on GitHub (Jun 28, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5360 Originally assigned to: @dhiltgen on GitHub. Hi all. I just got a Microsoft laptop7, the AIPC, with Snapdragon X Elite, NPU, Adreno GPU. It is an ARM based system. But I found that NPU is not running when using Ollama. Would it be supported by Ollama for the NPU and GPU?

GiteaMirror added the feature request windows labels 2026-04-12 13:58:22 -05:00

GiteaMirror commented

2026-04-12 13:58:23 -05:00

@tholum commented on GitHub (Jun 28, 2024):

I think more then support for The gpu, I think the Hexagon NPU would be better to support

@tholum commented on GitHub (Jun 28, 2024): I think more then support for The gpu, I think the Hexagon NPU would be better to support

GiteaMirror commented

2026-04-12 13:58:24 -05:00

@flyfox666 commented on GitHub (Jun 29, 2024):

I think more then support for The gpu, I think the Hexagon NPU would be better to support

Yeap , the NPU is better

@flyfox666 commented on GitHub (Jun 29, 2024): > I think more then support for The gpu, I think the Hexagon NPU would be better to support Yeap , the NPU is better

GiteaMirror commented

2026-04-12 13:58:25 -05:00

@leejw51 commented on GitHub (Jun 30, 2024):

on samsung galaxybook4 snapdargon x elite
ollama is too slow

@leejw51 commented on GitHub (Jun 30, 2024): on samsung galaxybook4 snapdargon x elite ollama is too slow

GiteaMirror commented

2026-04-12 13:58:26 -05:00

@Srafington commented on GitHub (Jun 30, 2024):

Those wanting a bit more oomf before this issue is addressed should run Ollama via WSL as there are native ARM binaries for Linux. They still won't support the NPU or GPU, but it is still much faster than running the Windows x86-64 binaries through emulation. SLMs like Phi are very speedy when run this way

@Srafington commented on GitHub (Jun 30, 2024): Those wanting a bit more oomf before this issue is addressed should run Ollama via WSL as there are native ARM binaries for Linux. They still won't support the NPU or GPU, but it is still much faster than running the Windows x86-64 binaries through emulation. SLMs like Phi are very speedy when run this way

GiteaMirror commented

2026-04-12 13:58:26 -05:00

@dhiltgen commented on GitHub (Jul 3, 2024):

We don't yet have an official arm windows binary, but you should be able to build from source until we do.

@dhiltgen commented on GitHub (Jul 3, 2024): We don't yet have an official arm windows binary, but you should be able to build from source until we do.

GiteaMirror commented

2026-04-12 13:58:27 -05:00

@danilofalcao commented on GitHub (Jul 7, 2024):

I would be available to test any developments on that matter if necessary.

@danilofalcao commented on GitHub (Jul 7, 2024): I would be available to test any developments on that matter if necessary.

GiteaMirror commented

2026-04-12 13:58:27 -05:00

@dhiltgen commented on GitHub (Jul 22, 2024):

Once #5712 merges we'll have official support for running in CPU mode on the Snapdragon systems, but additional PR(s) will need to merge upstream in llama.cpp before NPU/GPU support can be enabled. On my test system, I'm seeing ~18-20TPS on llama3 on the CPU, so it's usable. My understanding is the NPU may actually be slightly slower, although much more power efficient.

@dhiltgen commented on GitHub (Jul 22, 2024): Once #5712 merges we'll have official support for running in CPU mode on the Snapdragon systems, but additional PR(s) will need to merge upstream in llama.cpp before NPU/GPU support can be enabled. On my test system, I'm seeing ~18-20TPS on llama3 on the CPU, so it's usable. My understanding is the NPU may actually be slightly slower, although much more power efficient.

GiteaMirror commented

2026-04-12 13:58:28 -05:00

@AndreasKunar commented on GitHub (Jul 23, 2024):

Please note, that recent llama.cpp technology innovations with Q4_0_4_8 quantization on Snapdragon X CPUs give nearly the same performance or more as Q4_0 on base Apple Silicon with GPUs, see accelerating Q4_0 CPU performance 2-2.5x.

I also tried to get llama.cpp GPU-acceleration to work on Snapdragon X via Vulkan, but it's not working (yet) - see llama.cpp issue #8455.

@AndreasKunar commented on GitHub (Jul 23, 2024): Please note, that recent llama.cpp technology innovations with Q4_0_4_8 quantization on Snapdragon X CPUs give nearly the same performance or more as Q4_0 on base Apple Silicon with GPUs, [see accelerating Q4_0 CPU performance 2-2.5x](https://github.com/ggerganov/llama.cpp/pull/5780). I also tried to get llama.cpp GPU-acceleration to work on Snapdragon X via Vulkan, but it's not working (yet) - see [llama.cpp issue #8455](https://github.com/ggerganov/llama.cpp/issues/8455).

GiteaMirror commented

2026-04-12 13:58:28 -05:00

@AndreasKunar commented on GitHub (Jul 23, 2024):

E.g. here the performance of a Snapdragon X Plus (CPU-only, but Q4_0_4_8 optimized) vs. a 10-core M2 (CPU and GPU) for the new Llama3-8B Groq-Tool-use optimized local LLM. Yes, the Plus is still slower than the M2, but not by much, and the Elite is probably faster.

Snapdragon X Plus, Surface 11 Pro 16GB, Windows 11 24H2, MSVC+clang, llama.cpp build: 081fe431 (3441):

model	size	params	backend	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp512	46.02 ± 0.37
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	tg128	12.58 ± 2.63
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	10	pp512	150.04 ± 10.17
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	10	tg128	16.81 ± 3.46

M2 10GPU, MacBook Air 24GB, MacOS 14.5, llama.cpp build: 081fe431 (3441):

model	size	params	backend	ngl	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	Metal	0	pp512	58.54 ± 0.17
llama 8B Q4_0	4.33 GiB	8.03 B	Metal	0	tg128	12.97 ± 0.08
llama 8B Q4_0	4.33 GiB	8.03 B	Metal	99	pp512	178.03 ± 0.12
llama 8B Q4_0	4.33 GiB	8.03 B	Metal	99	tg128	19.20 ± 0.11

P.S: llama.cpp Q4_0_4_8 conversion is done via ./llama-quantize --allow-requantize <q4_0 model-name> <q4_0_4_8 name> Q4_0_4_8
P.P.S: token-generation (tg) is largely memory-bandwidth bound, while prompt-processing (pp) is compute-horsepower dependent.

@AndreasKunar commented on GitHub (Jul 23, 2024): E.g. here the performance of a Snapdragon X Plus (CPU-only, but Q4_0_4_8 optimized) vs. a 10-core M2 (CPU and GPU) for the new Llama3-8B Groq-Tool-use optimized local LLM. Yes, the Plus is still slower than the M2, but not by much, and the Elite is probably faster. Snapdragon X Plus, Surface 11 Pro 16GB, Windows 11 24H2, MSVC+clang, llama.cpp build: 081fe431 (3441): | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp512 | 46.02 ± 0.37 | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | tg128 | 12.58 ± 2.63 | | llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 10 | pp512 | 150.04 ± 10.17 | | llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 10 | tg128 | 16.81 ± 3.46 | M2 10GPU, MacBook Air 24GB, MacOS 14.5, llama.cpp build: 081fe431 (3441): | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | Metal | 0 | pp512 | 58.54 ± 0.17 | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | Metal | 0 | tg128 | 12.97 ± 0.08 | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | Metal | 99 | pp512 | 178.03 ± 0.12 | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | Metal | 99 | tg128 | 19.20 ± 0.11 | P.S: llama.cpp Q4_0_4_8 conversion is done via `./llama-quantize --allow-requantize <q4_0 model-name> <q4_0_4_8 name> Q4_0_4_8` P.P.S: token-generation (tg) is largely memory-bandwidth bound, while prompt-processing (pp) is compute-horsepower dependent.

GiteaMirror commented

2026-04-12 13:58:28 -05:00

@flyfox666 commented on GitHub (Jul 24, 2024):

> Once #5712 merges we'll have official support for running in CPU mode on the Snapdragon systems, but additional PR(s) will need to merge upstream in llama.cpp before NPU/GPU support can be enabled. On my test system, I'm seeing ~18-20TPS on llama3 on the CPU, so it's usable. My understanding is the NPU may actually be slightly slower, although much more power efficient.

hi thanks for the reply . Looking forward to it.

@flyfox666 commented on GitHub (Jul 24, 2024): > Once #5712 merges we'll have official support for running in CPU mode on the Snapdragon systems, but additional PR(s) will need to merge upstream in llama.cpp before NPU/GPU support can be enabled. On my test system, I'm seeing ~18-20TPS on llama3 on the CPU, so it's usable. My understanding is the NPU may actually be slightly slower, although much more power efficient. hi thanks for the reply . Looking forward to it.

GiteaMirror commented

2026-04-12 13:58:29 -05:00

@flyfox666 commented on GitHub (Jul 24, 2024):

E.g. here the performance of a Snapdragon X Plus (CPU-only, but Q4_0_4_8 optimized) vs. a 10-core M2 (CPU and GPU) for the new Llama3-8B Groq-Tool-use optimized local LLM. Yes, the Plus is still slower than the M2, but not by much, and the Elite is probably faster.

Snapdragon X Plus, Surface 11 Pro 16GB, Windows 11 24H2, MSVC+clang, llama.cpp build: 081fe431 (3441):

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp512 46.02 ± 0.37
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg128 12.58 ± 2.63
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 10 pp512 150.04 ± 10.17
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 10 tg128 16.81 ± 3.46
M2 10GPU, MacBook Air 24GB, MacOS 14.5, llama.cpp build: 081fe431 (3441):

model size params backend ngl test t/s
llama 8B Q4_0 4.33 GiB 8.03 B Metal 0 pp512 58.54 ± 0.17
llama 8B Q4_0 4.33 GiB 8.03 B Metal 0 tg128 12.97 ± 0.08
llama 8B Q4_0 4.33 GiB 8.03 B Metal 99 pp512 178.03 ± 0.12
llama 8B Q4_0 4.33 GiB 8.03 B Metal 99 tg128 19.20 ± 0.11
P.S: llama.cpp Q4_0_4_8 conversion is done via ./llama-quantize --allow-requantize <q4_0 model-name> <q4_0_4_8 name> Q4_0_4_8 P.P.S: token-generation (tg) is largely memory-bandwidth bound, while prompt-processing (pp) is compute-horsepower dependent.

thanks a lot

@flyfox666 commented on GitHub (Jul 24, 2024): > E.g. here the performance of a Snapdragon X Plus (CPU-only, but Q4_0_4_8 optimized) vs. a 10-core M2 (CPU and GPU) for the new Llama3-8B Groq-Tool-use optimized local LLM. Yes, the Plus is still slower than the M2, but not by much, and the Elite is probably faster. > > Snapdragon X Plus, Surface 11 Pro 16GB, Windows 11 24H2, MSVC+clang, llama.cpp build: 081fe431 (3441): > > model size params backend threads test t/s > llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp512 46.02 ± 0.37 > llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg128 12.58 ± 2.63 > llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 10 pp512 150.04 ± 10.17 > llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 10 tg128 16.81 ± 3.46 > M2 10GPU, MacBook Air 24GB, MacOS 14.5, llama.cpp build: 081fe431 (3441): > > model size params backend ngl test t/s > llama 8B Q4_0 4.33 GiB 8.03 B Metal 0 pp512 58.54 ± 0.17 > llama 8B Q4_0 4.33 GiB 8.03 B Metal 0 tg128 12.97 ± 0.08 > llama 8B Q4_0 4.33 GiB 8.03 B Metal 99 pp512 178.03 ± 0.12 > llama 8B Q4_0 4.33 GiB 8.03 B Metal 99 tg128 19.20 ± 0.11 > P.S: llama.cpp Q4_0_4_8 conversion is done via `./llama-quantize --allow-requantize <q4_0 model-name> <q4_0_4_8 name> Q4_0_4_8` P.P.S: token-generation (tg) is largely memory-bandwidth bound, while prompt-processing (pp) is compute-horsepower dependent. thanks a lot

GiteaMirror commented

2026-04-12 13:58:30 -05:00

@Hassansaleh22 commented on GitHub (Jul 27, 2024):

Thanks all,
Do you have an estimated timeline for when the necessary pull requests (#5712 and others for NPU/GPU support) will be merged? Also, will we need to uninstall the current version before updating to get native ARM working without emulation?

@Hassansaleh22 commented on GitHub (Jul 27, 2024): Thanks all, Do you have an estimated timeline for when the necessary pull requests (#5712 and others for NPU/GPU support) will be merged? Also, will we need to uninstall the current version before updating to get native ARM working without emulation?

GiteaMirror commented

2026-04-12 13:58:30 -05:00

@SebastianGode commented on GitHub (Aug 1, 2024):

@AndreasKunar Importing the Q4_0_4_8 build under WSL to native ARM of Ollama doesn't seem to work.
Ollama doesn't support Q4_0_4_8 yet, correct?

@SebastianGode commented on GitHub (Aug 1, 2024): @AndreasKunar Importing the Q4_0_4_8 build under WSL to native ARM of Ollama doesn't seem to work. Ollama doesn't support Q4_0_4_8 yet, correct?

GiteaMirror commented

2026-04-12 13:58:30 -05:00

@AndreasKunar commented on GitHub (Aug 1, 2024):

@AndreasKunar Importing the Q4_0_4_8 build under WSL to native ARM of Ollama doesn't seem to work. Ollama doesn't support Q4_0_4_8 yet, correct?

Q4_0_4_8 requires an arm64 compile of llama.cpp (Linux and Windows). And for Windows it requires a build with clang, since MSVC does not support the required inline asm for arm64. See the llama.cpp build instructions. I don't know how ollama builds, and if the llama.cpp component's build-process correctly builds for Windows on ARM - I have not tested PR#5712 yet.

Building for Snapdragon X in WSL2 with e.g. Ubuntu is commonly much easier, and its not slower than in native windows. Just don't forget to allocate CPUs and memory to the WSL2 in %USERPROFILE%\.wslconfig:

[wsl2]
processors=10
memory=12GB

I will try and build ollama in WSL2 on my Surface and try and import+use a Q4_0_4_8 model.

@AndreasKunar commented on GitHub (Aug 1, 2024): > @AndreasKunar Importing the Q4_0_4_8 build under WSL to native ARM of Ollama doesn't seem to work. Ollama doesn't support Q4_0_4_8 yet, correct? Q4_0_4_8 requires an arm64 compile of llama.cpp (Linux and Windows). And for Windows it requires a build with clang, since MSVC does not support the required inline asm for arm64. See the [llama.cpp build instructions](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md). I don't know how ollama builds, and if the llama.cpp component's build-process correctly builds for Windows on ARM - I have not tested PR#5712 yet. Building for Snapdragon X in WSL2 with e.g. Ubuntu is commonly much easier, and its not slower than in native windows. Just don't forget to allocate CPUs and memory to the WSL2 in `%USERPROFILE%\.wslconfig`: ```shell [wsl2] processors=10 memory=12GB ``` I will try and build ollama in WSL2 on my Surface and try and import+use a Q4_0_4_8 model.

GiteaMirror commented

2026-04-12 13:58:31 -05:00

@AndreasKunar commented on GitHub (Aug 1, 2024):

@SebastianGode - I tried to build ollama on WSL2/Ubuntu24.04 on my Surface 11 Pro and test it with Q4_0_4_8.
Ollama+llama.cpp builds, imports my local llama-2 Q4_0, and runs it.
But when I try and import my local llama-2 Q4_0_4_8 model (which runs with llama.cpp), it gives an "Error: invalid file magic" from its ggml.go module (at line#311), which does not seem to understand the new Q4_0_4_4 and Q4_0_4_8 formats.

Should we raise an issue?

@AndreasKunar commented on GitHub (Aug 1, 2024): @SebastianGode - I tried to build ollama on WSL2/Ubuntu24.04 on my Surface 11 Pro and test it with Q4_0_4_8. Ollama+llama.cpp builds, imports my local llama-2 Q4_0, and runs it. But when I try and import my local llama-2 Q4_0_4_8 model (which runs with llama.cpp), it gives an "Error: invalid file magic" from its ggml.go module (at line#311), which does not seem to understand the new Q4_0_4_4 and Q4_0_4_8 formats. Should we raise an issue?

GiteaMirror commented

2026-04-12 13:58:31 -05:00

@SebastianGode commented on GitHub (Aug 1, 2024):

@AndreasKunar Yes, that is the exact same issue for me. Good that you could verify that and that I wasn't too dumb to use Ollama.

Please go ahead and open an issue. I assume this shouldn't be that hard to fix, likely just some dependency which would need to be updated (but that's just my assumption).

@SebastianGode commented on GitHub (Aug 1, 2024): @AndreasKunar Yes, that is the exact same issue for me. Good that you could verify that and that I wasn't too dumb to use Ollama. Please go ahead and open an issue. I assume this shouldn't be that hard to fix, likely just some dependency which would need to be updated (but that's just my assumption).

GiteaMirror commented

2026-04-12 13:58:32 -05:00

@Berowne commented on GitHub (Aug 28, 2024):

I'm keen to stand on the shoulders of giants. I've subscribed to this thread! Keep up the good work.

@Berowne commented on GitHub (Aug 28, 2024): I'm keen to stand on the shoulders of giants. I've subscribed to this thread! Keep up the good work.

GiteaMirror commented

2026-04-12 13:58:32 -05:00

@arudaev commented on GitHub (Sep 9, 2024):

I'm new to using llama.cpp and related tools. After testing my device, I'm satisfied with its performance, but Ollama is running very slowly. My goal is to set up a Docker container that leverages WSL2 to run Llama 3 (7B) efficiently and maximize performance. However, the available resources are overwhelming and unclear. I need a streamlined solution to run an Ollama container with optimal speed and accuracy.

@arudaev commented on GitHub (Sep 9, 2024): I'm new to using `llama.cpp` and related tools. After testing my device, I'm satisfied with its performance, but Ollama is running very slowly. My goal is to set up a Docker container that leverages WSL2 to run Llama 3 (7B) efficiently and maximize performance. However, the available resources are overwhelming and unclear. I need a streamlined solution to run an Ollama container with optimal speed and accuracy. - [Microsoft Surface Pro, 11th 32GB: GPU Geekbench](https://browser.geekbench.com/v6/compute/2729091) - [Microsoft Surface Pro, 11th 32GB: CPU Geekbench](https://browser.geekbench.com/v6/cpu/7688747) - [Microsoft Surface Pro, 11th 32GB: NPU Geekbench AI](https://browser.geekbench.com/ai/v1/49797)

GiteaMirror commented

2026-04-12 13:58:32 -05:00

@AndreasKunar commented on GitHub (Sep 9, 2024):

I'm new to using llama.cpp and related tools. After testing my device, I'm satisfied with its performance, but Ollama is running very slowly. My goal is to set up a Docker container that leverages WSL2 to run Llama 3 (7B) efficiently and maximize performance. However, the available resources are overwhelming and unclear. I need a streamlined solution to run an Ollama container with optimal speed and accuracy.

there is currently no GPU/NPU support for ollama (or the llama.cpp code its based on) for the Snapdragon X - so forget about GPU/NPU geekbench results, they don't matter. The underlying llama.cpp code does not work currently with the Qualcomm Vulkan GPU driver for Windows (in WSL2 the Vulkan-driver works, but is a very slow CPU-emulation). There is some work being done in llama.cpp to try and support the QNN code, but its quite far from being workable.

The Snapdragon X however supports accelerated execution via the CPU, but this is in its very early stages with the core llama.cpp-code ollama uses. This CPU-acceleration is mainly for prompt-processing (2-3x faster), and LLM token-generation is more dependent on memory-bandwidth and not improved much. And it currently requires a special model-format (quantized as Q4_0_4_8 instead of Q4_0).

ollama on Windows (not WSL2) is currently in preview. You manually need to compile it, if you want it to run natively on Windows for ARM - I would not recommended this for beginners. The installation you use might run emulated as x64-code.
WSL2 needs to be configured accordingly (file .wslconfig in your Windows user directory), in order to use the right amount of RAM (setting: memory, default is only 50%) and all CPUs (setting: processors, I suggest 12). You need to make sure, that you use an aarch64/arm64 ollama linux image. But ollama builds/runs really well on WSL2-linux. Running ollama in a correctly configured WSL2 is as fast (maybe even faster) as natively. There are performance penalties, if you don't store your files natively in WSL2/Linux. WSL2 currently does NOT support GPU/NPU acceleration for the Snapdragon X, but it supports CPU-acceleration of the llama.cpp-code
docker - I have no experience with running ollama on WSL2-based docker on Windows for ARM.

A few personal notes on the Surface Pro 11 and ollama/llama.cpp:

ollama is a great shell for reducing the complexity of the base llama.cpp code and I really like it!!! But the innovation on GPU/NPU acceleration happen first with llama.cpp. I use the llama.cpp llama-server instead of ollama, when trying out new things. But you need to manually download your model and start llama-server with the right parameters. As a benefit, llama-server is not only openAI-compatible APIs but also a playground-like webserver. When using llama3.1 8B with its long context, don't forget to limit the context-size, otherwise your RAM-use "explodes" (because of the required KV-cache for the default 128k context).
The thermals of the Surface Pro 11 tablet forces the Snapdragon X to throttle quite soon, if you max out all the CPUs during running your LLMs. Look at your CPU-utilization (there currently is no CPU-temperature-monitor for the Surfaces).
if you want GPU-acceleration for you Surface - you might try webGL-based AI (e.g. in crome)
NPU-accelerated AI on the Surface currently all seems Qualcomm QNN based. Microsoft's Semantic Kernel supports QNN (for C# code, they are working on PYthon-support).

Hope his helps/clarifies a little (und liebe Grüße aus Wien)

@AndreasKunar commented on GitHub (Sep 9, 2024): > I'm new to using `llama.cpp` and related tools. After testing my device, I'm satisfied with its performance, but Ollama is running very slowly. My goal is to set up a Docker container that leverages WSL2 to run Llama 3 (7B) efficiently and maximize performance. However, the available resources are overwhelming and unclear. I need a streamlined solution to run an Ollama container with optimal speed and accuracy. 1) there is currently no GPU/NPU support for ollama (or the llama.cpp code its based on) for the Snapdragon X - so forget about GPU/NPU geekbench results, they don't matter. The underlying llama.cpp code does not work currently with the Qualcomm Vulkan GPU driver for Windows (in WSL2 the Vulkan-driver works, but is a very slow CPU-emulation). There is some work being done in llama.cpp to try and support the QNN code, but its quite far from being workable. The Snapdragon X however supports accelerated execution via the CPU, but this is in its very early stages with the core llama.cpp-code ollama uses. This CPU-acceleration is mainly for prompt-processing (2-3x faster), and LLM token-generation is more dependent on memory-bandwidth and not improved much. And it currently requires a special model-format (quantized as Q4_0_4_8 instead of Q4_0). 2) ollama on Windows (not WSL2) is currently in preview. You manually need to compile it, if you want it to run natively on Windows for ARM - I would not recommended this for beginners. The installation you use might run emulated as x64-code. 3) WSL2 needs to be configured accordingly (file `.wslconfig` in your Windows user directory), in order to use the right amount of RAM (setting: memory, default is only 50%) and all CPUs (setting: processors, I suggest 12). You need to make sure, that you use an aarch64/arm64 ollama linux image. But ollama builds/runs really well on WSL2-linux. Running ollama in a correctly configured WSL2 is as fast (maybe even faster) as natively. There are performance penalties, if you don't store your files natively in WSL2/Linux. WSL2 currently does NOT support GPU/NPU acceleration for the Snapdragon X, but it supports CPU-acceleration of the llama.cpp-code 4) docker - I have no experience with running ollama on WSL2-based docker on Windows for ARM. A few personal notes on the Surface Pro 11 and ollama/llama.cpp: * ollama is a great shell for reducing the complexity of the base llama.cpp code and I really like it!!! But the innovation on GPU/NPU acceleration happen first with llama.cpp. I use the llama.cpp `llama-server` instead of ollama, when trying out new things. But you need to manually download your model and start llama-server with the right parameters. As a benefit, llama-server is not only openAI-compatible APIs but also a playground-like webserver. When using llama3.1 8B with its long context, don't forget to limit the context-size, otherwise your RAM-use "explodes" (because of the required KV-cache for the default 128k context). * The thermals of the Surface Pro 11 tablet forces the Snapdragon X to throttle quite soon, if you max out all the CPUs during running your LLMs. Look at your CPU-utilization (there currently is no CPU-temperature-monitor for the Surfaces). * if you want GPU-acceleration for you Surface - you might try webGL-based AI (e.g. in crome) * NPU-accelerated AI on the Surface currently all seems Qualcomm QNN based. Microsoft's Semantic Kernel supports QNN (for C# code, they are working on PYthon-support). Hope his helps/clarifies a little (und liebe Grüße aus Wien)

GiteaMirror commented

2026-04-12 13:58:33 -05:00

@arudaev commented on GitHub (Sep 9, 2024):

I'm new to using llama.cpp and related tools. After testing my device, I'm satisfied with its performance, but Ollama is running very slowly. My goal is to set up a Docker container that leverages WSL2 to run Llama 3 (7B) efficiently and maximize performance. However, the available resources are overwhelming and unclear. I need a streamlined solution to run an Ollama container with optimal speed and accuracy.

there is currently no GPU/NPU support for ollama (or the llama.cpp code its based on) for the Snapdragon X - so forget about GPU/NPU geekbench results, they don't matter. The underlying llama.cpp code does not work currently with the Qualcomm Vulkan GPU driver for Windows (in WSL2 the Vulkan-driver works, but is a very slow CPU-emulation). There is some work being done in llama.cpp to try and support the QNN code, but its quite far from being workable.

The Snapdragon X however supports accelerated execution via the CPU, but this is in its very early stages with the core llama.cpp-code ollama uses. This CPU-acceleration is mainly for prompt-processing (2-3x faster), and LLM token-generation is more dependent on memory-bandwidth and not improved much. And it currently requires a special model-format (quantized as Q4_0_4_8 instead of Q4_0).

ollama on Windows (not WSL2) is currently in preview. You manually need to compile it, if you want it to run natively on Windows for ARM - I would not recommended this for beginners. The installation you use might run emulated as x64-code.

WSL2 needs to be configured accordingly (file .wslconfig in your Windows user directory), in order to use the right amount of RAM (setting: memory, default is only 50%) and all CPUs (setting: processors, I suggest 12). You need to make sure, that you use an aarch64/arm64 ollama linux image. But ollama builds/runs really well on WSL2-linux. Running ollama in a correctly configured WSL2 is as fast (maybe even faster) as natively. There are performance penalties, if you don't store your files natively in WSL2/Linux. WSL2 currently does NOT support GPU/NPU acceleration for the Snapdragon X, but it supports CPU-acceleration of the llama.cpp-code

docker - I have no experience with running ollama on WSL2-based docker on Windows for ARM.

A few personal notes on the Surface Pro 11 and ollama/llama.cpp:

ollama is a great shell for reducing the complexity of the base llama.cpp code and I really like it!!! But the innovation on GPU/NPU acceleration happen first with llama.cpp. I use the llama.cpp llama-server instead of ollama, when trying out new things. But you need to manually download your model and start llama-server with the right parameters. As a benefit, llama-server is not only openAI-compatible APIs but also a playground-like webserver. When using llama3.1 8B with its long context, don't forget to limit the context-size, otherwise your RAM-use "explodes" (because of the required KV-cache for the default 128k context).

The thermals of the Surface Pro 11 tablet forces the Snapdragon X to throttle quite soon, if you max out all the CPUs during running your LLMs. Look at your CPU-utilization (there currently is no CPU-temperature-monitor for the Surfaces).

if you want GPU-acceleration for you Surface - you might try webGL-based AI (e.g. in crome)

NPU-accelerated AI on the Surface currently all seems Qualcomm QNN based. Microsoft's Semantic Kernel supports QNN (for C# code, they are working on PYthon-support).

Hope his helps/clarifies a little (und liebe Grüße aus Wien)

Thank you so much for your detailed response and insights! You've clarified a lot of points that were overwhelming and confusing. Even due to the current limitation of the SP11 device, I hope to still develop a container that works on WSL2 on Windows for ARM.

I'll take a closer look at using llama.cpp with the llama-server as you suggested, especially for new experiments. I'll also keep in mind the context-size limits to avoid excessive RAM usage with Llama 3.1. It's a great reminder to check CPU utilization to prevent thermal throttling on the Surface Pro 11.

Your advice has given me a lot of direction, and I really appreciate your time and insights!

@arudaev commented on GitHub (Sep 9, 2024): > > I'm new to using `llama.cpp` and related tools. After testing my device, I'm satisfied with its performance, but Ollama is running very slowly. My goal is to set up a Docker container that leverages WSL2 to run Llama 3 (7B) efficiently and maximize performance. However, the available resources are overwhelming and unclear. I need a streamlined solution to run an Ollama container with optimal speed and accuracy. > > 1. there is currently no GPU/NPU support for ollama (or the llama.cpp code its based on) for the Snapdragon X - so forget about GPU/NPU geekbench results, they don't matter. The underlying llama.cpp code does not work currently with the Qualcomm Vulkan GPU driver for Windows (in WSL2 the Vulkan-driver works, but is a very slow CPU-emulation). There is some work being done in llama.cpp to try and support the QNN code, but its quite far from being workable. > > The Snapdragon X however supports accelerated execution via the CPU, but this is in its very early stages with the core llama.cpp-code ollama uses. This CPU-acceleration is mainly for prompt-processing (2-3x faster), and LLM token-generation is more dependent on memory-bandwidth and not improved much. And it currently requires a special model-format (quantized as Q4_0_4_8 instead of Q4_0). > > 2. ollama on Windows (not WSL2) is currently in preview. You manually need to compile it, if you want it to run natively on Windows for ARM - I would not recommended this for beginners. The installation you use might run emulated as x64-code. > 3. WSL2 needs to be configured accordingly (file `.wslconfig` in your Windows user directory), in order to use the right amount of RAM (setting: memory, default is only 50%) and all CPUs (setting: processors, I suggest 12). You need to make sure, that you use an aarch64/arm64 ollama linux image. But ollama builds/runs really well on WSL2-linux. Running ollama in a correctly configured WSL2 is as fast (maybe even faster) as natively. There are performance penalties, if you don't store your files natively in WSL2/Linux. WSL2 currently does NOT support GPU/NPU acceleration for the Snapdragon X, but it supports CPU-acceleration of the llama.cpp-code > 4. docker - I have no experience with running ollama on WSL2-based docker on Windows for ARM. > > A few personal notes on the Surface Pro 11 and ollama/llama.cpp: > > * ollama is a great shell for reducing the complexity of the base llama.cpp code and I really like it!!! But the innovation on GPU/NPU acceleration happen first with llama.cpp. I use the llama.cpp `llama-server` instead of ollama, when trying out new things. But you need to manually download your model and start llama-server with the right parameters. As a benefit, llama-server is not only openAI-compatible APIs but also a playground-like webserver. When using llama3.1 8B with its long context, don't forget to limit the context-size, otherwise your RAM-use "explodes" (because of the required KV-cache for the default 128k context). > * The thermals of the Surface Pro 11 tablet forces the Snapdragon X to throttle quite soon, if you max out all the CPUs during running your LLMs. Look at your CPU-utilization (there currently is no CPU-temperature-monitor for the Surfaces). > * if you want GPU-acceleration for you Surface - you might try webGL-based AI (e.g. in crome) > * NPU-accelerated AI on the Surface currently all seems Qualcomm QNN based. Microsoft's Semantic Kernel supports QNN (for C# code, they are working on PYthon-support). > > Hope his helps/clarifies a little (und liebe Grüße aus Wien) Thank you so much for your detailed response and insights! You've clarified a lot of points that were overwhelming and confusing. Even due to the current limitation of the SP11 device, I hope to still develop a container that works on WSL2 on Windows for ARM. I'll take a closer look at using `llama.cpp` with the `llama-server` as you suggested, especially for new experiments. I'll also keep in mind the context-size limits to avoid excessive RAM usage with Llama 3.1. It's a great reminder to check CPU utilization to prevent thermal throttling on the Surface Pro 11. Your advice has given me a lot of direction, and I really appreciate your time and insights!

GiteaMirror commented

2026-04-12 13:58:33 -05:00

@twlswan commented on GitHub (Sep 11, 2024):

There is some work being done in llama.cpp to try and support the QNN code, but its quite far from being workable.

I'm completely out of the loop, but hasnt that PR (#6869) been closed by llama.cpp's maintainer after the PR author complained without tact (I do wonder how much of it was even his intention since his English was clearly A2 level at best...)?

That said, thanks a ton for sharing, it looks like the x elite (especially the skus with 12 cores) is actually pretty good.

@twlswan commented on GitHub (Sep 11, 2024): > 1. There is some work being done in llama.cpp to try and support the QNN code, but its quite far from being workable. I'm completely out of the loop, but hasnt that PR (#6869) been closed by llama.cpp's maintainer after the PR author complained without tact (I do wonder how much of it was even his intention since his English was clearly A2 level at best...)? That said, thanks a ton for sharing, it looks like the x elite (especially the skus with 12 cores) is actually pretty good.

GiteaMirror commented

2026-04-12 13:58:34 -05:00

@AndreasKunar commented on GitHub (Sep 12, 2024):

… but hasnt that PR (#6869) been closed by llama.cpp's maintainer after the PR author complained without tact (I do wonder how much of it was even his intention since his English was clearly A2 level at best...)? …

Someone in the thread forked his own version and seems still to be working on it, not the PR originator. I‘m currently swamped with other work, but try and get into it deeper in October.

@AndreasKunar commented on GitHub (Sep 12, 2024): >… but hasnt that PR (#6869) been closed by llama.cpp's maintainer after the PR author complained without tact (I do wonder how much of it was even his intention since his English was clearly A2 level at best...)? … Someone in the thread forked his own version and seems still to be working on it, not the PR originator. I‘m currently swamped with other work, but try and get into it deeper in October.

GiteaMirror commented

2026-04-12 13:58:35 -05:00

@jonathanarava commented on GitHub (Oct 28, 2024):

Can we please bump this ticket up somehow? Or at least, which link can I go to track the development on this.

I am currently using the Ollama 0.3.14 on the Snapdragon X Elite. It is really good on running the llama 3.1, 8b model (even if it offloaded to the CPU and not using the GPU). But obviously I would like for it to use the GPU (talking the comment of running on NPU results in lower tokens/s at face value).

Thank you

@jonathanarava commented on GitHub (Oct 28, 2024): Can we please bump this ticket up somehow? Or at least, which link can I go to track the development on this. I am currently using the `Ollama 0.3.14` on the Snapdragon X Elite. It is really good on running the llama 3.1, 8b model (even if it offloaded to the CPU and not using the GPU). But obviously I would like for it to use the GPU (talking the comment of running on NPU results in lower tokens/s at face value). Thank you

GiteaMirror commented

2026-04-12 13:58:35 -05:00

@AndreasKunar commented on GitHub (Oct 28, 2024):

Can we please bump this ticket up somehow? Or at least, which link can I go to to track the development on this.

There seems to be no development being done, which can be used for ollama,…

I am currently using the Ollama 0.3.14 on the Snapdragon X Elite. It is really good on running the llama 3.2, 8b model (even if it offloaded to the GPU and not using the CPU). But obviously I would like for it to use the GPU (talking the comment of running on NPU results in lower tokens/s at face value).

I don‘t think that running it on the GPU will run it faster than e.g. the Q4_0_4_4 quantization runs on the CPU. I also have a M2 10-GPU and running Q4_0 on its GPUs has approx. the same token/s performance as my Snapdragon X Elite on the CPU with Q4_0_4_4. Its Ardeno GPU has less horsepower than the M2‘s. So there is little benefit to be had for doing a lot of work, and for very few users - running the GPU on the Snapdragon X via Vulkan on Windows / llama.cpp does not work because of a driver issue. As for supporting the NPU, even ONNX/QNN cannot use the NPU for Llama models - apparently its to complicated, or maybe I was just to stupid to get it to work.

So net my recommendation is - don‘t think the Snapdragon X’s GPU/NPU will get full LLM support by llama.cpp inference anytime soon. The NPU will likely be only useable for very small, dedicated SLMs inside special apps developed with QNN. And all the rest will run (quite fast) on the CPU. Also remember, that LLM inference is largely bound by the memory-bandwidth, and not so much by compute-horsepower, so there is not much to be gained for developing the special GPU code.

@AndreasKunar commented on GitHub (Oct 28, 2024): > Can we please bump this ticket up somehow? Or at least, which link can I go to to track the development on this. There seems to be no development being done, which can be used for ollama,… > I am currently using the `Ollama 0.3.14` on the Snapdragon X Elite. It is really good on running the llama 3.2, 8b model (even if it offloaded to the GPU and not using the CPU). But obviously I would like for it to use the GPU (talking the comment of running on NPU results in lower tokens/s at face value). I don‘t think that running it on the GPU will run it faster than e.g. the Q4_0_4_4 quantization runs on the CPU. I also have a M2 10-GPU and running Q4_0 on its GPUs has approx. the same token/s performance as my Snapdragon X Elite on the CPU with Q4_0_4_4. Its Ardeno GPU has less horsepower than the M2‘s. So there is little benefit to be had for doing a lot of work, and for very few users - running the GPU on the Snapdragon X via Vulkan on Windows / llama.cpp does not work because of a driver issue. As for supporting the NPU, even ONNX/QNN cannot use the NPU for Llama models - apparently its to complicated, or maybe I was just to stupid to get it to work. So net my recommendation is - don‘t think the Snapdragon X’s GPU/NPU will get full LLM support by llama.cpp inference anytime soon. The NPU will likely be only useable for very small, dedicated SLMs inside special apps developed with QNN. And all the rest will run (quite fast) on the CPU. Also remember, that LLM inference is largely bound by the memory-bandwidth, and not so much by compute-horsepower, so there is not much to be gained for developing the special GPU code.

GiteaMirror commented

2026-04-12 13:58:36 -05:00

@jonathanarava commented on GitHub (Oct 28, 2024):

Thank you for your swift response. Your explanation makes sense.

I agree that memory bandwidth is a critical factor. (rhetorical question) Wouldn't it be more efficient to load the entire model onto the GPU? This approach could potentially minimize the CPU cycles required for data transfer between RAM and the CPU, leading to improved inference times. I understand that the CPU may not be the bottleneck in this scenario, but overall, it would be interesting to see the full capability of using CPU, RAM and the NPU on low spec devices.

Thanks again!

@jonathanarava commented on GitHub (Oct 28, 2024): Thank you for your swift response. Your explanation makes sense. I agree that memory bandwidth is a critical factor. (rhetorical question) Wouldn't it be more efficient to load the entire model onto the GPU? This approach could potentially minimize the CPU cycles required for data transfer between RAM and the CPU, leading to improved inference times. I understand that the CPU may not be the bottleneck in this scenario, but overall, it would be interesting to see the full capability of using CPU, RAM and the NPU on low spec devices. Thanks again!

GiteaMirror commented

2026-04-12 13:58:37 -05:00

@AndreasKunar commented on GitHub (Oct 28, 2024):

I agree that memory bandwidth is a critical factor. (rhetorical question) Wouldn't it be more efficient to load the entire model onto the GPU? This approach could potentially minimize the CPU cycles required for data transfer between RAM and the CPU, leading to improved inference times. I understand that the CPU may not be the bottleneck in this scenario, but overall, it would be interesting to see the full capability of using CPU, RAM and the NPU on low spec devices.

LLMs generate each new token by computing the entire graph of their artificial neural network again and again. So they have to pump all the entire Billions of parameters, the KV caches (this is something like the AI’s short-term memory, it gets huge / to GBs with large contexts like 128k of llama 3.1),… out of unified RAM memory (these SoCs don‘t have dedicated RAM for the CPU/GPU/NPU) into the quite tiny on-chip caches for processing the computations. This has to happen for each token anew. They can do a lot of computations at the same time (e.g. my M2 Mac‘s GPU has over 1000 units for simultaneous computation / ALUs), so the GPUs idle a lot, waiting for their data from memory. This is why modern SoCs have a RAM bandwidth of 100-130 GByte/s. And yet this still is the bottleneck, even the Snapdragon X CPUs have enough simultaneous matrix-processing units to handle it. The M2 Pro doubles bandwidth to 200, the M2 Max has 400, the M2 Ultra has 800, and the NVIDIA 4090 to over 1000 - that‘s why they are faster.

Only when the LLM processes the prompt initially and during learning/fine-tuning, the LLM can batch the processing for multiple tokens at once, and then the GPUs can totally shine with their horsepower. This is why learning is done on NVIDIA, and the Macs with 96 or 192 GB RAM are perfect for „cheap“ inference of quite large models (NVIDIA RAM is crazy expensive). And a lot of development is done for these, e.g. ollama, llama.cpp,…

@AndreasKunar commented on GitHub (Oct 28, 2024): > I agree that memory bandwidth is a critical factor. (rhetorical question) Wouldn't it be more efficient to load the entire model onto the GPU? This approach could potentially minimize the CPU cycles required for data transfer between RAM and the CPU, leading to improved inference times. I understand that the CPU may not be the bottleneck in this scenario, but overall, it would be interesting to see the full capability of using CPU, RAM and the NPU on low spec devices. LLMs generate each new token by computing the entire graph of their artificial neural network again and again. So they have to pump all the entire Billions of parameters, the KV caches (this is something like the AI’s short-term memory, it gets huge / to GBs with large contexts like 128k of llama 3.1),… out of unified RAM memory (these SoCs don‘t have dedicated RAM for the CPU/GPU/NPU) into the quite tiny on-chip caches for processing the computations. This has to happen for each token anew. They can do a lot of computations at the same time (e.g. my M2 Mac‘s GPU has over 1000 units for simultaneous computation / ALUs), so the GPUs idle a lot, waiting for their data from memory. This is why modern SoCs have a RAM bandwidth of 100-130 GByte/s. And yet this still is the bottleneck, even the Snapdragon X CPUs have enough simultaneous matrix-processing units to handle it. The M2 Pro doubles bandwidth to 200, the M2 Max has 400, the M2 Ultra has 800, and the NVIDIA 4090 to over 1000 - that‘s why they are faster. Only when the LLM processes the prompt initially and during learning/fine-tuning, the LLM can batch the processing for multiple tokens at once, and then the GPUs can totally shine with their horsepower. This is why learning is done on NVIDIA, and the Macs with 96 or 192 GB RAM are perfect for „cheap“ inference of quite large models (NVIDIA RAM is crazy expensive). And a lot of development is done for these, e.g. ollama, llama.cpp,…

GiteaMirror commented

2026-04-12 13:58:38 -05:00

@jonathanarava commented on GitHub (Oct 31, 2024):

Thank you for the detailed explanation, Andreas! Your insights into the limitations of memory bandwidth and how LLMs process tokens have really helped clarify things. It makes sense that loading the entire model onto the GPU could potentially minimize CPU overhead, but as you pointed out, the underlying architecture of these SoCs complicates that.

Thanks again for your help!

@jonathanarava commented on GitHub (Oct 31, 2024): Thank you for the detailed explanation, Andreas! Your insights into the limitations of memory bandwidth and how LLMs process tokens have really helped clarify things. It makes sense that loading the entire model onto the GPU could potentially minimize CPU overhead, but as you pointed out, the underlying architecture of these SoCs complicates that. Thanks again for your help!

GiteaMirror commented

2026-04-12 13:58:39 -05:00

@behroozbc commented on GitHub (May 9, 2025):

Is any update available for this issue?

@behroozbc commented on GitHub (May 9, 2025): Is any update available for this issue?

GiteaMirror commented

2026-04-12 13:58:39 -05:00

@AndreasKunar commented on GitHub (May 11, 2025):

Is any update available for this issue?

Here is the current status of Snapdragon X GPU/NPU support to my knowledge:

GPU: llama.cpp's has an openCL backend. But this is still slower than running on the very fast Snapdragon X's CPUs. So it currently makes no sense to use this.
NPU: It's still not implemented by llama.cpp. Details/progress see here.
Microsoft's AI Toolkit for VSCode enables you to play with some NPU-models (see there for new developments). But last time I tested this, it's slow vs. the Snapdragon X CPU's horsepower.

I could not find out, if using the GPU/NPU would yield more power-efficiency while still having good performance. My problem with the Snapdragon X is, that I could not get any power-consumption metrics of its SoC.

I tried to find out, if GPU/NPU/... use can help with power-consumption. Both for prompt-processing/fill (compute-bound) and token-generation/decode (mainly memory-bandwidth bound) - See my medium.com article, where I analyzed my Apples's M-series SoC and the very power-efficient NVIDIA Jetson Orin SoC (but did not have any chance to measure SoC,.. power-draw on my Snapdragon X machine).

@AndreasKunar commented on GitHub (May 11, 2025): > Is any update available for this issue? Here is the current status of Snapdragon X GPU/NPU support to my knowledge: - **GPU**: llama.cpp's has an [openCL backend](https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/OPENCL.md). But this is **still slower** than running on the very fast Snapdragon X's CPUs. So **it currently makes no sense to use this**. - **NPU**: It's **still not implemented by llama.cpp**. Details/progress see [here](https://github.com/ggml-org/llama.cpp/issues/7772). Microsoft's [AI Toolkit for VSCode](https://marketplace.visualstudio.com/items?itemName=ms-windows-ai-studio.windows-ai-studio) enables you to play with some NPU-models (see there for new developments). But last time I tested this, it's slow vs. the Snapdragon X CPU's horsepower. I could not find out, if using the GPU/NPU would yield more power-efficiency while still having good performance. My problem with the Snapdragon X is, that I could not get any power-consumption metrics of its SoC. I tried to find out, if GPU/NPU/... use can help with power-consumption. Both for prompt-processing/fill (compute-bound) and token-generation/decode (mainly memory-bandwidth bound) - See [my medium.com article](https://medium.com/@andreask_75652/power-consumption-of-our-ai-use-f2b1f9bce97b), where I analyzed my Apples's M-series SoC and the very power-efficient NVIDIA Jetson Orin SoC (but did not have any chance to measure SoC,.. power-draw on my Snapdragon X machine).

GiteaMirror commented

2026-04-12 13:58:41 -05:00

@chraac commented on GitHub (May 11, 2025):

I tried to find out, if GPU/NPU/... use can help with power-consumption. Both for prompt-processing/fill (compute-bound) and token-generation/decode (mainly memory-bandwidth bound) - See my medium.com article, where I analyzed my Apples's M-series SoC and the very power-efficient NVIDIA Jetson Orin SoC (but did not have any chance to measure SoC,.. power-draw on my Snapdragon X machine).

Looks hwinfo64 now can run on windows arm laptop, don't konw whether there're some power metrics available

@chraac commented on GitHub (May 11, 2025): > I tried to find out, if GPU/NPU/... use can help with power-consumption. Both for prompt-processing/fill (compute-bound) and token-generation/decode (mainly memory-bandwidth bound) - See [my medium.com article](https://medium.com/@andreask_75652/power-consumption-of-our-ai-use-f2b1f9bce97b), where I analyzed my Apples's M-series SoC and the very power-efficient NVIDIA Jetson Orin SoC (but did not have any chance to measure SoC,.. power-draw on my Snapdragon X machine). Looks [hwinfo64](https://www.hwinfo.com/forum/threads/arm64-windows-compatible-app.9506/) now can run on windows arm laptop, don't konw whether there're some power metrics available ![Image](https://github.com/user-attachments/assets/cef6a4e2-47f9-46da-aa37-1d35df0cc428)

GiteaMirror commented

2026-04-12 13:58:42 -05:00

@AndreasKunar commented on GitHub (May 11, 2025):

I tried to find out, if GPU/NPU/... use can help with power-consumption. Both for prompt-processing/fill (compute-bound) and token-generation/decode (mainly memory-bandwidth bound) - See my medium.com article, where I analyzed my Apples's M-series SoC and the very power-efficient NVIDIA Jetson Orin SoC (but did not have any chance to measure SoC,.. power-draw on my Snapdragon X machine).

Looks hwinfo64 now can run on windows arm laptop, don't konw whether there're some power metrics available

Thanks. I'm using HWMonitor on the Surface, and this provides a lot of sensory-information for arm64, but no power-consumption details yet. HWMonitor is adding new sensors from time to time. I have not tried hwinfo64 yet. Windows' powercfg /SYSTEMPOWERREPORT displays some "Energy Meter" data for the CPU und GPU, but I could not figure out, how to use this, and if the NPU does an enegy-meter input.

Overall it's to complicate / too much effort for me on the Snapdragon X On Apple/NVIDIA hardware it's easier. On Macs there's e.g. github: tlkh/asitop or exelban/stats. On the NVIDIA Jetson, there is rbonghi/jetson_stats. Both provide SoC power-consumption values which I could use together with running llama.cpp performance-measurements.

@AndreasKunar commented on GitHub (May 11, 2025): > > I tried to find out, if GPU/NPU/... use can help with power-consumption. Both for prompt-processing/fill (compute-bound) and token-generation/decode (mainly memory-bandwidth bound) - See [my medium.com article](https://medium.com/@andreask_75652/power-consumption-of-our-ai-use-f2b1f9bce97b), where I analyzed my Apples's M-series SoC and the very power-efficient NVIDIA Jetson Orin SoC (but did not have any chance to measure SoC,.. power-draw on my Snapdragon X machine). > > Looks [hwinfo64](https://www.hwinfo.com/forum/threads/arm64-windows-compatible-app.9506/) now can run on windows arm laptop, don't konw whether there're some power metrics available Thanks. I'm using HWMonitor on the Surface, and this provides a lot of sensory-information for arm64, but no power-consumption details yet. HWMonitor is adding new sensors from time to time. I have not tried hwinfo64 yet. Windows' `powercfg /SYSTEMPOWERREPORT` displays some "Energy Meter" data for the CPU und GPU, but I could not figure out, how to use this, and if the NPU does an enegy-meter input. Overall it's to complicate / too much effort for me on the Snapdragon X On Apple/NVIDIA hardware it's easier. On Macs there's e.g. github: tlkh/asitop or exelban/stats. On the NVIDIA Jetson, there is rbonghi/jetson_stats. Both provide SoC power-consumption values which I could use together with running llama.cpp performance-measurements.

GiteaMirror commented

2026-04-12 13:58:43 -05:00

@samirgaire10 commented on GitHub (Jun 7, 2025):

Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please
Have faster support for snapdragon NPU and GPU ,,, AAAAAAAAAAAAAAAA We urgently need better and faster support for Snapdragon's NPU and GPU — please make this a top priority!

@samirgaire10 commented on GitHub (Jun 7, 2025): Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Have faster support for snapdragon NPU and GPU ,,, AAAAAAAAAAAAAAAA We urgently need better and faster support for Snapdragon's NPU and GPU — please make this a top priority!

GiteaMirror commented

2026-04-12 13:58:44 -05:00

@sXe79 commented on GitHub (Dec 18, 2025):

Hello, I finally decided to try a local LLM, just to find out the NPU of my Surface Laptop 7 XElite is 0% used. Meh :/

@sXe79 commented on GitHub (Dec 18, 2025): Hello, I finally decided to try a local LLM, just to find out the NPU of my Surface Laptop 7 XElite is 0% used. Meh :/

GiteaMirror commented

2026-04-12 13:58:45 -05:00

@rpascalsdl commented on GitHub (Dec 19, 2025):

@sXe79 to be fair, the state of NPUs is laughable at best. From my testing on a Snapdragon, you're better off running LLMs on CPU.

If you really want to give NPUs a try you have the option of Nexa SDK, or using VS Code with AI Toolkit extension. Just make sure you use models that are marked with Qualcomm NPU support.

Last note, always run LLMs when plugged in. For some reason, at least my laptop, throttles it down so much that it's completely unusable.

@rpascalsdl commented on GitHub (Dec 19, 2025): @sXe79 to be fair, the state of NPUs is laughable at best. From my testing on a Snapdragon, you're better off running LLMs on CPU. If you really want to give NPUs a try you have the option of Nexa SDK, or using VS Code with AI Toolkit extension. Just make sure you use models that are marked with Qualcomm NPU support. Last note, always run LLMs when plugged in. For some reason, at least my laptop, throttles it down so much that it's completely unusable.

GiteaMirror commented

2026-04-12 13:58:45 -05:00

@xgdgsc commented on GitHub (Dec 19, 2025):

https://github.com/microsoft/Foundry-Local already works for xelite NPU. Although the service isn' t as stable, I get crashes. But you can try out.

@xgdgsc commented on GitHub (Dec 19, 2025): https://github.com/microsoft/Foundry-Local already works for xelite NPU. Although the service isn' t as stable, I get crashes. But you can try out.

GiteaMirror commented

2026-04-12 13:58:45 -05:00

@lyleschemmerling commented on GitHub (Dec 19, 2025):

https://anythingllm.com/ can also run the NPU, which is interesting because their main backend is ollama. But it is not impressive, the CPU usually wins in an apples to apples comparison.

I have gotten GPU acceleration to work on the Adreno. It was a huge pain for negligible performance gain, and the drivers are not stable. People are still working on it, but I doubt the juice will be worth the squeeze.

The Snapdragon is pretty well optimized as far as CPUs go. Stick with that.

@lyleschemmerling commented on GitHub (Dec 19, 2025): https://anythingllm.com/ can also run the NPU, which is interesting because their main backend is ollama. But it is not impressive, the CPU usually wins in an apples to apples comparison. I have gotten GPU acceleration to work on the Adreno. It was a huge pain for negligible performance gain, and the drivers are not stable. People are still working on it, but I doubt the juice will be worth the squeeze. The Snapdragon is pretty well optimized as far as CPUs go. Stick with that.

GiteaMirror commented

2026-04-12 13:58:46 -05:00

@BootsSiR commented on GitHub (Dec 19, 2025):

@sXe79 to be fair, the state of NPUs is laughable at best. From my testing on a Snapdragon, you're better off running LLMs on CPU.

If you really want to give NPUs a try you have the option of Nexa SDK, or using VS Code with AI Toolkit extension. Just make sure you use models that are marked with Qualcomm NPU support.

Last note, always run LLMs when plugged in. For some reason, at least my laptop, throttles it down so much that it's completely unusable.

I tested LLM's with both CPU and NPU on my Snapdragon device and the CPU crushed the NPU in terms of performance.

@BootsSiR commented on GitHub (Dec 19, 2025): > [@sXe79](https://github.com/sXe79) to be fair, the state of NPUs is laughable at best. From my testing on a Snapdragon, you're better off running LLMs on CPU. > > If you really want to give NPUs a try you have the option of Nexa SDK, or using VS Code with AI Toolkit extension. Just make sure you use models that are marked with Qualcomm NPU support. > > Last note, always run LLMs when plugged in. For some reason, at least my laptop, throttles it down so much that it's completely unusable. I tested LLM's with both CPU and NPU on my Snapdragon device and the CPU crushed the NPU in terms of performance.

GiteaMirror commented

2026-04-12 13:58:46 -05:00

@rpascalsdl commented on GitHub (Dec 19, 2025):

Since the subject of GPU was raised, GPT-OSS 20B runs at a very acceptable up to 20 tokens per second using Nexa SDK - compared to about 3-4 on CPU. But I do suspect they're doing some dark magic to make it happen. Maybe that's why it's not in their public list of models, but if you go through their X posts, you can figure out how to run it.

@rpascalsdl commented on GitHub (Dec 19, 2025): Since the subject of GPU was raised, GPT-OSS 20B runs at a very acceptable up to 20 tokens per second using Nexa SDK - compared to about 3-4 on CPU. But I do suspect they're doing some dark magic to make it happen. Maybe that's why it's not in their public list of models, but if you go through their X posts, you can figure out how to run it.

GiteaMirror commented

2026-04-12 13:58:47 -05:00

@lyleschemmerling commented on GitHub (Dec 19, 2025):

Interesting. I might give it another shot this weekend. If I succeed I'll try to update this thread.

@lyleschemmerling commented on GitHub (Dec 19, 2025): Interesting. I might give it another shot this weekend. If I succeed I'll try to update this thread.

GiteaMirror commented

2026-04-12 13:58:47 -05:00

@rjtokenring commented on GitHub (Mar 6, 2026):

WIP: https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/snapdragon/README.md

@rjtokenring commented on GitHub (Mar 6, 2026): WIP: https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/snapdragon/README.md

GiteaMirror commented

2026-04-12 13:58:47 -05:00

@behroozbc commented on GitHub (Apr 9, 2026):

Is any update available for this issue?

@behroozbc commented on GitHub (Apr 9, 2026): Is any update available for this issue?

GiteaMirror commented

2026-04-12 13:58:48 -05:00

@arudaev commented on GitHub (Apr 9, 2026):

Is any update available for this issue?

good question, I think the simple answer is, NPU isn't made to run LLM, or GGUF models on it... its made to run built in AI features with same speed as before, but less energy consumption and to not impact the CPU/GPU

@arudaev commented on GitHub (Apr 9, 2026): > Is any update available for this issue? good question, I think the simple answer is, NPU isn't made to run LLM, or GGUF models on it... its made to run built in AI features with same speed as before, but less energy consumption and to not impact the CPU/GPU

GiteaMirror referenced this issue

2026-04-22 05:25:45 -05:00

[GH-ISSUE #3357] Run GGUF files directly #27822

GiteaMirror referenced this issue

2026-04-28 08:51:52 -05:00

[GH-ISSUE #3357] Run GGUF files directly #48574

GiteaMirror referenced this issue

2026-05-03 16:14:02 -05:00

[GH-ISSUE #3357] Run GGUF files directly #64100

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

hoyyeva/editor-config-repair

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

hoyyeva/launch-backup-ux

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-mlx-decode-checkpoints

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

brucemacd/download-before-remove

parth/update-claude-docs

parth-anthropic-reference-images-path

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#3357