[GH-ISSUE #5360] Support for Snapdragon X Elite NPU & GPU #3357

Open
opened 2026-04-12 13:58:21 -05:00 by GiteaMirror · 42 comments
Owner

Originally created by @flyfox666 on GitHub (Jun 28, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5360

Originally assigned to: @dhiltgen on GitHub.

Hi all.

I just got a Microsoft laptop7, the AIPC, with Snapdragon X Elite, NPU, Adreno GPU. It is an ARM based system.

But I found that NPU is not running when using Ollama.

Would it be supported by Ollama for the NPU and GPU?

Originally created by @flyfox666 on GitHub (Jun 28, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5360 Originally assigned to: @dhiltgen on GitHub. Hi all. I just got a Microsoft laptop7, the AIPC, with Snapdragon X Elite, NPU, Adreno GPU. It is an ARM based system. But I found that NPU is not running when using Ollama. Would it be supported by Ollama for the NPU and GPU?
GiteaMirror added the feature requestwindows labels 2026-04-12 13:58:22 -05:00
Author
Owner

@tholum commented on GitHub (Jun 28, 2024):

I think more then support for The gpu, I think the Hexagon NPU would be better to support

<!-- gh-comment-id:2197771872 --> @tholum commented on GitHub (Jun 28, 2024): I think more then support for The gpu, I think the Hexagon NPU would be better to support
Author
Owner

@flyfox666 commented on GitHub (Jun 29, 2024):

I think more then support for The gpu, I think the Hexagon NPU would be better to support

Yeap , the NPU is better

<!-- gh-comment-id:2197873730 --> @flyfox666 commented on GitHub (Jun 29, 2024): > I think more then support for The gpu, I think the Hexagon NPU would be better to support Yeap , the NPU is better
Author
Owner

@leejw51 commented on GitHub (Jun 30, 2024):

on samsung galaxybook4 snapdargon x elite
ollama is too slow

<!-- gh-comment-id:2198404845 --> @leejw51 commented on GitHub (Jun 30, 2024): on samsung galaxybook4 snapdargon x elite ollama is too slow
Author
Owner

@Srafington commented on GitHub (Jun 30, 2024):

Those wanting a bit more oomf before this issue is addressed should run Ollama via WSL as there are native ARM binaries for Linux. They still won't support the NPU or GPU, but it is still much faster than running the Windows x86-64 binaries through emulation. SLMs like Phi are very speedy when run this way

<!-- gh-comment-id:2198599939 --> @Srafington commented on GitHub (Jun 30, 2024): Those wanting a bit more oomf before this issue is addressed should run Ollama via WSL as there are native ARM binaries for Linux. They still won't support the NPU or GPU, but it is still much faster than running the Windows x86-64 binaries through emulation. SLMs like Phi are very speedy when run this way
Author
Owner

@dhiltgen commented on GitHub (Jul 3, 2024):

We don't yet have an official arm windows binary, but you should be able to build from source until we do.

<!-- gh-comment-id:2204751465 --> @dhiltgen commented on GitHub (Jul 3, 2024): We don't yet have an official arm windows binary, but you should be able to build from source until we do.
Author
Owner

@danilofalcao commented on GitHub (Jul 7, 2024):

I would be available to test any developments on that matter if necessary.

<!-- gh-comment-id:2212545381 --> @danilofalcao commented on GitHub (Jul 7, 2024): I would be available to test any developments on that matter if necessary.
Author
Owner

@dhiltgen commented on GitHub (Jul 22, 2024):

Once #5712 merges we'll have official support for running in CPU mode on the Snapdragon systems, but additional PR(s) will need to merge upstream in llama.cpp before NPU/GPU support can be enabled. On my test system, I'm seeing ~18-20TPS on llama3 on the CPU, so it's usable. My understanding is the NPU may actually be slightly slower, although much more power efficient.

<!-- gh-comment-id:2243413223 --> @dhiltgen commented on GitHub (Jul 22, 2024): Once #5712 merges we'll have official support for running in CPU mode on the Snapdragon systems, but additional PR(s) will need to merge upstream in llama.cpp before NPU/GPU support can be enabled. On my test system, I'm seeing ~18-20TPS on llama3 on the CPU, so it's usable. My understanding is the NPU may actually be slightly slower, although much more power efficient.
Author
Owner

@AndreasKunar commented on GitHub (Jul 23, 2024):

Please note, that recent llama.cpp technology innovations with Q4_0_4_8 quantization on Snapdragon X CPUs give nearly the same performance or more as Q4_0 on base Apple Silicon with GPUs, see accelerating Q4_0 CPU performance 2-2.5x.

I also tried to get llama.cpp GPU-acceleration to work on Snapdragon X via Vulkan, but it's not working (yet) - see llama.cpp issue #8455.

<!-- gh-comment-id:2244283266 --> @AndreasKunar commented on GitHub (Jul 23, 2024): Please note, that recent llama.cpp technology innovations with Q4_0_4_8 quantization on Snapdragon X CPUs give nearly the same performance or more as Q4_0 on base Apple Silicon with GPUs, [see accelerating Q4_0 CPU performance 2-2.5x](https://github.com/ggerganov/llama.cpp/pull/5780). I also tried to get llama.cpp GPU-acceleration to work on Snapdragon X via Vulkan, but it's not working (yet) - see [llama.cpp issue #8455](https://github.com/ggerganov/llama.cpp/issues/8455).
Author
Owner

@AndreasKunar commented on GitHub (Jul 23, 2024):

E.g. here the performance of a Snapdragon X Plus (CPU-only, but Q4_0_4_8 optimized) vs. a 10-core M2 (CPU and GPU) for the new Llama3-8B Groq-Tool-use optimized local LLM. Yes, the Plus is still slower than the M2, but not by much, and the Elite is probably faster.

Snapdragon X Plus, Surface 11 Pro 16GB, Windows 11 24H2, MSVC+clang, llama.cpp build: 081fe431 (3441):

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp512 46.02 ± 0.37
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg128 12.58 ± 2.63
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 10 pp512 150.04 ± 10.17
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 10 tg128 16.81 ± 3.46

M2 10GPU, MacBook Air 24GB, MacOS 14.5, llama.cpp build: 081fe431 (3441):

model size params backend ngl test t/s
llama 8B Q4_0 4.33 GiB 8.03 B Metal 0 pp512 58.54 ± 0.17
llama 8B Q4_0 4.33 GiB 8.03 B Metal 0 tg128 12.97 ± 0.08
llama 8B Q4_0 4.33 GiB 8.03 B Metal 99 pp512 178.03 ± 0.12
llama 8B Q4_0 4.33 GiB 8.03 B Metal 99 tg128 19.20 ± 0.11

P.S: llama.cpp Q4_0_4_8 conversion is done via ./llama-quantize --allow-requantize <q4_0 model-name> <q4_0_4_8 name> Q4_0_4_8
P.P.S: token-generation (tg) is largely memory-bandwidth bound, while prompt-processing (pp) is compute-horsepower dependent.

<!-- gh-comment-id:2244357036 --> @AndreasKunar commented on GitHub (Jul 23, 2024): E.g. here the performance of a Snapdragon X Plus (CPU-only, but Q4_0_4_8 optimized) vs. a 10-core M2 (CPU and GPU) for the new Llama3-8B Groq-Tool-use optimized local LLM. Yes, the Plus is still slower than the M2, but not by much, and the Elite is probably faster. Snapdragon X Plus, Surface 11 Pro 16GB, Windows 11 24H2, MSVC+clang, llama.cpp build: 081fe431 (3441): | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp512 | 46.02 ± 0.37 | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | tg128 | 12.58 ± 2.63 | | llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 10 | pp512 | 150.04 ± 10.17 | | llama 8B Q4_0_4_8 | 4.33 GiB | 8.03 B | CPU | 10 | tg128 | 16.81 ± 3.46 | M2 10GPU, MacBook Air 24GB, MacOS 14.5, llama.cpp build: 081fe431 (3441): | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | Metal | 0 | pp512 | 58.54 ± 0.17 | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | Metal | 0 | tg128 | 12.97 ± 0.08 | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | Metal | 99 | pp512 | 178.03 ± 0.12 | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | Metal | 99 | tg128 | 19.20 ± 0.11 | P.S: llama.cpp Q4_0_4_8 conversion is done via `./llama-quantize --allow-requantize <q4_0 model-name> <q4_0_4_8 name> Q4_0_4_8` P.P.S: token-generation (tg) is largely memory-bandwidth bound, while prompt-processing (pp) is compute-horsepower dependent.
Author
Owner

@flyfox666 commented on GitHub (Jul 24, 2024):

> Once #5712 merges we'll have official support for running in CPU mode on the Snapdragon systems, but additional PR(s) will need to merge upstream in llama.cpp before NPU/GPU support can be enabled. On my test system, I'm seeing ~18-20TPS on llama3 on the CPU, so it's usable. My understanding is the NPU may actually be slightly slower, although much more power efficient.

hi thanks for the reply . Looking forward to it.

<!-- gh-comment-id:2247786949 --> @flyfox666 commented on GitHub (Jul 24, 2024): &gt; Once #5712 merges we'll have official support for running in CPU mode on the Snapdragon systems, but additional PR(s) will need to merge upstream in llama.cpp before NPU/GPU support can be enabled. On my test system, I'm seeing ~18-20TPS on llama3 on the CPU, so it's usable. My understanding is the NPU may actually be slightly slower, although much more power efficient. hi thanks for the reply . Looking forward to it.
Author
Owner

@flyfox666 commented on GitHub (Jul 24, 2024):

E.g. here the performance of a Snapdragon X Plus (CPU-only, but Q4_0_4_8 optimized) vs. a 10-core M2 (CPU and GPU) for the new Llama3-8B Groq-Tool-use optimized local LLM. Yes, the Plus is still slower than the M2, but not by much, and the Elite is probably faster.

Snapdragon X Plus, Surface 11 Pro 16GB, Windows 11 24H2, MSVC+clang, llama.cpp build: 081fe431 (3441):

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp512 46.02 ± 0.37
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg128 12.58 ± 2.63
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 10 pp512 150.04 ± 10.17
llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 10 tg128 16.81 ± 3.46
M2 10GPU, MacBook Air 24GB, MacOS 14.5, llama.cpp build: 081fe431 (3441):

model size params backend ngl test t/s
llama 8B Q4_0 4.33 GiB 8.03 B Metal 0 pp512 58.54 ± 0.17
llama 8B Q4_0 4.33 GiB 8.03 B Metal 0 tg128 12.97 ± 0.08
llama 8B Q4_0 4.33 GiB 8.03 B Metal 99 pp512 178.03 ± 0.12
llama 8B Q4_0 4.33 GiB 8.03 B Metal 99 tg128 19.20 ± 0.11
P.S: llama.cpp Q4_0_4_8 conversion is done via ./llama-quantize --allow-requantize <q4_0 model-name> <q4_0_4_8 name> Q4_0_4_8 P.P.S: token-generation (tg) is largely memory-bandwidth bound, while prompt-processing (pp) is compute-horsepower dependent.

thanks a lot

<!-- gh-comment-id:2247787930 --> @flyfox666 commented on GitHub (Jul 24, 2024): > E.g. here the performance of a Snapdragon X Plus (CPU-only, but Q4_0_4_8 optimized) vs. a 10-core M2 (CPU and GPU) for the new Llama3-8B Groq-Tool-use optimized local LLM. Yes, the Plus is still slower than the M2, but not by much, and the Elite is probably faster. > > Snapdragon X Plus, Surface 11 Pro 16GB, Windows 11 24H2, MSVC+clang, llama.cpp build: 081fe431 (3441): > > model size params backend threads test t/s > llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp512 46.02 ± 0.37 > llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg128 12.58 ± 2.63 > llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 10 pp512 150.04 ± 10.17 > llama 8B Q4_0_4_8 4.33 GiB 8.03 B CPU 10 tg128 16.81 ± 3.46 > M2 10GPU, MacBook Air 24GB, MacOS 14.5, llama.cpp build: 081fe431 (3441): > > model size params backend ngl test t/s > llama 8B Q4_0 4.33 GiB 8.03 B Metal 0 pp512 58.54 ± 0.17 > llama 8B Q4_0 4.33 GiB 8.03 B Metal 0 tg128 12.97 ± 0.08 > llama 8B Q4_0 4.33 GiB 8.03 B Metal 99 pp512 178.03 ± 0.12 > llama 8B Q4_0 4.33 GiB 8.03 B Metal 99 tg128 19.20 ± 0.11 > P.S: llama.cpp Q4_0_4_8 conversion is done via `./llama-quantize --allow-requantize <q4_0 model-name> <q4_0_4_8 name> Q4_0_4_8` P.P.S: token-generation (tg) is largely memory-bandwidth bound, while prompt-processing (pp) is compute-horsepower dependent. thanks a lot
Author
Owner

@Hassansaleh22 commented on GitHub (Jul 27, 2024):

Thanks all,
Do you have an estimated timeline for when the necessary pull requests (#5712 and others for NPU/GPU support) will be merged? Also, will we need to uninstall the current version before updating to get native ARM working without emulation?

<!-- gh-comment-id:2254258440 --> @Hassansaleh22 commented on GitHub (Jul 27, 2024): Thanks all, Do you have an estimated timeline for when the necessary pull requests (#5712 and others for NPU/GPU support) will be merged? Also, will we need to uninstall the current version before updating to get native ARM working without emulation?
Author
Owner

@SebastianGode commented on GitHub (Aug 1, 2024):

@AndreasKunar Importing the Q4_0_4_8 build under WSL to native ARM of Ollama doesn't seem to work.
Ollama doesn't support Q4_0_4_8 yet, correct?

<!-- gh-comment-id:2263414167 --> @SebastianGode commented on GitHub (Aug 1, 2024): @AndreasKunar Importing the Q4_0_4_8 build under WSL to native ARM of Ollama doesn't seem to work. Ollama doesn't support Q4_0_4_8 yet, correct?
Author
Owner

@AndreasKunar commented on GitHub (Aug 1, 2024):

@AndreasKunar Importing the Q4_0_4_8 build under WSL to native ARM of Ollama doesn't seem to work. Ollama doesn't support Q4_0_4_8 yet, correct?

Q4_0_4_8 requires an arm64 compile of llama.cpp (Linux and Windows). And for Windows it requires a build with clang, since MSVC does not support the required inline asm for arm64. See the llama.cpp build instructions. I don't know how ollama builds, and if the llama.cpp component's build-process correctly builds for Windows on ARM - I have not tested PR#5712 yet.

Building for Snapdragon X in WSL2 with e.g. Ubuntu is commonly much easier, and its not slower than in native windows. Just don't forget to allocate CPUs and memory to the WSL2 in %USERPROFILE%\.wslconfig:

[wsl2]
processors=10
memory=12GB

I will try and build ollama in WSL2 on my Surface and try and import+use a Q4_0_4_8 model.

<!-- gh-comment-id:2263466857 --> @AndreasKunar commented on GitHub (Aug 1, 2024): > @AndreasKunar Importing the Q4_0_4_8 build under WSL to native ARM of Ollama doesn't seem to work. Ollama doesn't support Q4_0_4_8 yet, correct? Q4_0_4_8 requires an arm64 compile of llama.cpp (Linux and Windows). And for Windows it requires a build with clang, since MSVC does not support the required inline asm for arm64. See the [llama.cpp build instructions](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md). I don't know how ollama builds, and if the llama.cpp component's build-process correctly builds for Windows on ARM - I have not tested PR#5712 yet. Building for Snapdragon X in WSL2 with e.g. Ubuntu is commonly much easier, and its not slower than in native windows. Just don't forget to allocate CPUs and memory to the WSL2 in `%USERPROFILE%\.wslconfig`: ```shell [wsl2] processors=10 memory=12GB ``` I will try and build ollama in WSL2 on my Surface and try and import+use a Q4_0_4_8 model.
Author
Owner

@AndreasKunar commented on GitHub (Aug 1, 2024):

@SebastianGode - I tried to build ollama on WSL2/Ubuntu24.04 on my Surface 11 Pro and test it with Q4_0_4_8.
Ollama+llama.cpp builds, imports my local llama-2 Q4_0, and runs it.
But when I try and import my local llama-2 Q4_0_4_8 model (which runs with llama.cpp), it gives an "Error: invalid file magic" from its ggml.go module (at line#311), which does not seem to understand the new Q4_0_4_4 and Q4_0_4_8 formats.

Should we raise an issue?

<!-- gh-comment-id:2263545749 --> @AndreasKunar commented on GitHub (Aug 1, 2024): @SebastianGode - I tried to build ollama on WSL2/Ubuntu24.04 on my Surface 11 Pro and test it with Q4_0_4_8. Ollama+llama.cpp builds, imports my local llama-2 Q4_0, and runs it. But when I try and import my local llama-2 Q4_0_4_8 model (which runs with llama.cpp), it gives an "Error: invalid file magic" from its ggml.go module (at line#311), which does not seem to understand the new Q4_0_4_4 and Q4_0_4_8 formats. Should we raise an issue?
Author
Owner

@SebastianGode commented on GitHub (Aug 1, 2024):

@AndreasKunar Yes, that is the exact same issue for me. Good that you could verify that and that I wasn't too dumb to use Ollama.

Please go ahead and open an issue. I assume this shouldn't be that hard to fix, likely just some dependency which would need to be updated (but that's just my assumption).

<!-- gh-comment-id:2263560369 --> @SebastianGode commented on GitHub (Aug 1, 2024): @AndreasKunar Yes, that is the exact same issue for me. Good that you could verify that and that I wasn't too dumb to use Ollama. Please go ahead and open an issue. I assume this shouldn't be that hard to fix, likely just some dependency which would need to be updated (but that's just my assumption).
Author
Owner

@Berowne commented on GitHub (Aug 28, 2024):

I'm keen to stand on the shoulders of giants. I've subscribed to this thread! Keep up the good work.

<!-- gh-comment-id:2314444905 --> @Berowne commented on GitHub (Aug 28, 2024): I'm keen to stand on the shoulders of giants. I've subscribed to this thread! Keep up the good work.
Author
Owner

@arudaev commented on GitHub (Sep 9, 2024):

I'm new to using llama.cpp and related tools. After testing my device, I'm satisfied with its performance, but Ollama is running very slowly. My goal is to set up a Docker container that leverages WSL2 to run Llama 3 (7B) efficiently and maximize performance. However, the available resources are overwhelming and unclear. I need a streamlined solution to run an Ollama container with optimal speed and accuracy.

<!-- gh-comment-id:2337907404 --> @arudaev commented on GitHub (Sep 9, 2024): I'm new to using `llama.cpp` and related tools. After testing my device, I'm satisfied with its performance, but Ollama is running very slowly. My goal is to set up a Docker container that leverages WSL2 to run Llama 3 (7B) efficiently and maximize performance. However, the available resources are overwhelming and unclear. I need a streamlined solution to run an Ollama container with optimal speed and accuracy. - [Microsoft Surface Pro, 11th 32GB: GPU Geekbench](https://browser.geekbench.com/v6/compute/2729091) - [Microsoft Surface Pro, 11th 32GB: CPU Geekbench](https://browser.geekbench.com/v6/cpu/7688747) - [Microsoft Surface Pro, 11th 32GB: NPU Geekbench AI](https://browser.geekbench.com/ai/v1/49797)
Author
Owner

@AndreasKunar commented on GitHub (Sep 9, 2024):

I'm new to using llama.cpp and related tools. After testing my device, I'm satisfied with its performance, but Ollama is running very slowly. My goal is to set up a Docker container that leverages WSL2 to run Llama 3 (7B) efficiently and maximize performance. However, the available resources are overwhelming and unclear. I need a streamlined solution to run an Ollama container with optimal speed and accuracy.

  1. there is currently no GPU/NPU support for ollama (or the llama.cpp code its based on) for the Snapdragon X - so forget about GPU/NPU geekbench results, they don't matter. The underlying llama.cpp code does not work currently with the Qualcomm Vulkan GPU driver for Windows (in WSL2 the Vulkan-driver works, but is a very slow CPU-emulation). There is some work being done in llama.cpp to try and support the QNN code, but its quite far from being workable.

The Snapdragon X however supports accelerated execution via the CPU, but this is in its very early stages with the core llama.cpp-code ollama uses. This CPU-acceleration is mainly for prompt-processing (2-3x faster), and LLM token-generation is more dependent on memory-bandwidth and not improved much. And it currently requires a special model-format (quantized as Q4_0_4_8 instead of Q4_0).

  1. ollama on Windows (not WSL2) is currently in preview. You manually need to compile it, if you want it to run natively on Windows for ARM - I would not recommended this for beginners. The installation you use might run emulated as x64-code.

  2. WSL2 needs to be configured accordingly (file .wslconfig in your Windows user directory), in order to use the right amount of RAM (setting: memory, default is only 50%) and all CPUs (setting: processors, I suggest 12). You need to make sure, that you use an aarch64/arm64 ollama linux image. But ollama builds/runs really well on WSL2-linux. Running ollama in a correctly configured WSL2 is as fast (maybe even faster) as natively. There are performance penalties, if you don't store your files natively in WSL2/Linux. WSL2 currently does NOT support GPU/NPU acceleration for the Snapdragon X, but it supports CPU-acceleration of the llama.cpp-code

  3. docker - I have no experience with running ollama on WSL2-based docker on Windows for ARM.

A few personal notes on the Surface Pro 11 and ollama/llama.cpp:

  • ollama is a great shell for reducing the complexity of the base llama.cpp code and I really like it!!! But the innovation on GPU/NPU acceleration happen first with llama.cpp. I use the llama.cpp llama-server instead of ollama, when trying out new things. But you need to manually download your model and start llama-server with the right parameters. As a benefit, llama-server is not only openAI-compatible APIs but also a playground-like webserver. When using llama3.1 8B with its long context, don't forget to limit the context-size, otherwise your RAM-use "explodes" (because of the required KV-cache for the default 128k context).

  • The thermals of the Surface Pro 11 tablet forces the Snapdragon X to throttle quite soon, if you max out all the CPUs during running your LLMs. Look at your CPU-utilization (there currently is no CPU-temperature-monitor for the Surfaces).

  • if you want GPU-acceleration for you Surface - you might try webGL-based AI (e.g. in crome)

  • NPU-accelerated AI on the Surface currently all seems Qualcomm QNN based. Microsoft's Semantic Kernel supports QNN (for C# code, they are working on PYthon-support).

Hope his helps/clarifies a little (und liebe Grüße aus Wien)

<!-- gh-comment-id:2338164433 --> @AndreasKunar commented on GitHub (Sep 9, 2024): > I'm new to using `llama.cpp` and related tools. After testing my device, I'm satisfied with its performance, but Ollama is running very slowly. My goal is to set up a Docker container that leverages WSL2 to run Llama 3 (7B) efficiently and maximize performance. However, the available resources are overwhelming and unclear. I need a streamlined solution to run an Ollama container with optimal speed and accuracy. 1) there is currently no GPU/NPU support for ollama (or the llama.cpp code its based on) for the Snapdragon X - so forget about GPU/NPU geekbench results, they don't matter. The underlying llama.cpp code does not work currently with the Qualcomm Vulkan GPU driver for Windows (in WSL2 the Vulkan-driver works, but is a very slow CPU-emulation). There is some work being done in llama.cpp to try and support the QNN code, but its quite far from being workable. The Snapdragon X however supports accelerated execution via the CPU, but this is in its very early stages with the core llama.cpp-code ollama uses. This CPU-acceleration is mainly for prompt-processing (2-3x faster), and LLM token-generation is more dependent on memory-bandwidth and not improved much. And it currently requires a special model-format (quantized as Q4_0_4_8 instead of Q4_0). 2) ollama on Windows (not WSL2) is currently in preview. You manually need to compile it, if you want it to run natively on Windows for ARM - I would not recommended this for beginners. The installation you use might run emulated as x64-code. 3) WSL2 needs to be configured accordingly (file `.wslconfig` in your Windows user directory), in order to use the right amount of RAM (setting: memory, default is only 50%) and all CPUs (setting: processors, I suggest 12). You need to make sure, that you use an aarch64/arm64 ollama linux image. But ollama builds/runs really well on WSL2-linux. Running ollama in a correctly configured WSL2 is as fast (maybe even faster) as natively. There are performance penalties, if you don't store your files natively in WSL2/Linux. WSL2 currently does NOT support GPU/NPU acceleration for the Snapdragon X, but it supports CPU-acceleration of the llama.cpp-code 4) docker - I have no experience with running ollama on WSL2-based docker on Windows for ARM. A few personal notes on the Surface Pro 11 and ollama/llama.cpp: * ollama is a great shell for reducing the complexity of the base llama.cpp code and I really like it!!! But the innovation on GPU/NPU acceleration happen first with llama.cpp. I use the llama.cpp `llama-server` instead of ollama, when trying out new things. But you need to manually download your model and start llama-server with the right parameters. As a benefit, llama-server is not only openAI-compatible APIs but also a playground-like webserver. When using llama3.1 8B with its long context, don't forget to limit the context-size, otherwise your RAM-use "explodes" (because of the required KV-cache for the default 128k context). * The thermals of the Surface Pro 11 tablet forces the Snapdragon X to throttle quite soon, if you max out all the CPUs during running your LLMs. Look at your CPU-utilization (there currently is no CPU-temperature-monitor for the Surfaces). * if you want GPU-acceleration for you Surface - you might try webGL-based AI (e.g. in crome) * NPU-accelerated AI on the Surface currently all seems Qualcomm QNN based. Microsoft's Semantic Kernel supports QNN (for C# code, they are working on PYthon-support). Hope his helps/clarifies a little (und liebe Grüße aus Wien)
Author
Owner

@arudaev commented on GitHub (Sep 9, 2024):

I'm new to using llama.cpp and related tools. After testing my device, I'm satisfied with its performance, but Ollama is running very slowly. My goal is to set up a Docker container that leverages WSL2 to run Llama 3 (7B) efficiently and maximize performance. However, the available resources are overwhelming and unclear. I need a streamlined solution to run an Ollama container with optimal speed and accuracy.

  1. there is currently no GPU/NPU support for ollama (or the llama.cpp code its based on) for the Snapdragon X - so forget about GPU/NPU geekbench results, they don't matter. The underlying llama.cpp code does not work currently with the Qualcomm Vulkan GPU driver for Windows (in WSL2 the Vulkan-driver works, but is a very slow CPU-emulation). There is some work being done in llama.cpp to try and support the QNN code, but its quite far from being workable.

The Snapdragon X however supports accelerated execution via the CPU, but this is in its very early stages with the core llama.cpp-code ollama uses. This CPU-acceleration is mainly for prompt-processing (2-3x faster), and LLM token-generation is more dependent on memory-bandwidth and not improved much. And it currently requires a special model-format (quantized as Q4_0_4_8 instead of Q4_0).

  1. ollama on Windows (not WSL2) is currently in preview. You manually need to compile it, if you want it to run natively on Windows for ARM - I would not recommended this for beginners. The installation you use might run emulated as x64-code.
  2. WSL2 needs to be configured accordingly (file .wslconfig in your Windows user directory), in order to use the right amount of RAM (setting: memory, default is only 50%) and all CPUs (setting: processors, I suggest 12). You need to make sure, that you use an aarch64/arm64 ollama linux image. But ollama builds/runs really well on WSL2-linux. Running ollama in a correctly configured WSL2 is as fast (maybe even faster) as natively. There are performance penalties, if you don't store your files natively in WSL2/Linux. WSL2 currently does NOT support GPU/NPU acceleration for the Snapdragon X, but it supports CPU-acceleration of the llama.cpp-code
  3. docker - I have no experience with running ollama on WSL2-based docker on Windows for ARM.

A few personal notes on the Surface Pro 11 and ollama/llama.cpp:

  • ollama is a great shell for reducing the complexity of the base llama.cpp code and I really like it!!! But the innovation on GPU/NPU acceleration happen first with llama.cpp. I use the llama.cpp llama-server instead of ollama, when trying out new things. But you need to manually download your model and start llama-server with the right parameters. As a benefit, llama-server is not only openAI-compatible APIs but also a playground-like webserver. When using llama3.1 8B with its long context, don't forget to limit the context-size, otherwise your RAM-use "explodes" (because of the required KV-cache for the default 128k context).
  • The thermals of the Surface Pro 11 tablet forces the Snapdragon X to throttle quite soon, if you max out all the CPUs during running your LLMs. Look at your CPU-utilization (there currently is no CPU-temperature-monitor for the Surfaces).
  • if you want GPU-acceleration for you Surface - you might try webGL-based AI (e.g. in crome)
  • NPU-accelerated AI on the Surface currently all seems Qualcomm QNN based. Microsoft's Semantic Kernel supports QNN (for C# code, they are working on PYthon-support).

Hope his helps/clarifies a little (und liebe Grüße aus Wien)

Thank you so much for your detailed response and insights! You've clarified a lot of points that were overwhelming and confusing. Even due to the current limitation of the SP11 device, I hope to still develop a container that works on WSL2 on Windows for ARM.

I'll take a closer look at using llama.cpp with the llama-server as you suggested, especially for new experiments. I'll also keep in mind the context-size limits to avoid excessive RAM usage with Llama 3.1. It's a great reminder to check CPU utilization to prevent thermal throttling on the Surface Pro 11.

Your advice has given me a lot of direction, and I really appreciate your time and insights!

<!-- gh-comment-id:2338239174 --> @arudaev commented on GitHub (Sep 9, 2024): > > I'm new to using `llama.cpp` and related tools. After testing my device, I'm satisfied with its performance, but Ollama is running very slowly. My goal is to set up a Docker container that leverages WSL2 to run Llama 3 (7B) efficiently and maximize performance. However, the available resources are overwhelming and unclear. I need a streamlined solution to run an Ollama container with optimal speed and accuracy. > > 1. there is currently no GPU/NPU support for ollama (or the llama.cpp code its based on) for the Snapdragon X - so forget about GPU/NPU geekbench results, they don't matter. The underlying llama.cpp code does not work currently with the Qualcomm Vulkan GPU driver for Windows (in WSL2 the Vulkan-driver works, but is a very slow CPU-emulation). There is some work being done in llama.cpp to try and support the QNN code, but its quite far from being workable. > > The Snapdragon X however supports accelerated execution via the CPU, but this is in its very early stages with the core llama.cpp-code ollama uses. This CPU-acceleration is mainly for prompt-processing (2-3x faster), and LLM token-generation is more dependent on memory-bandwidth and not improved much. And it currently requires a special model-format (quantized as Q4_0_4_8 instead of Q4_0). > > 2. ollama on Windows (not WSL2) is currently in preview. You manually need to compile it, if you want it to run natively on Windows for ARM - I would not recommended this for beginners. The installation you use might run emulated as x64-code. > 3. WSL2 needs to be configured accordingly (file `.wslconfig` in your Windows user directory), in order to use the right amount of RAM (setting: memory, default is only 50%) and all CPUs (setting: processors, I suggest 12). You need to make sure, that you use an aarch64/arm64 ollama linux image. But ollama builds/runs really well on WSL2-linux. Running ollama in a correctly configured WSL2 is as fast (maybe even faster) as natively. There are performance penalties, if you don't store your files natively in WSL2/Linux. WSL2 currently does NOT support GPU/NPU acceleration for the Snapdragon X, but it supports CPU-acceleration of the llama.cpp-code > 4. docker - I have no experience with running ollama on WSL2-based docker on Windows for ARM. > > A few personal notes on the Surface Pro 11 and ollama/llama.cpp: > > * ollama is a great shell for reducing the complexity of the base llama.cpp code and I really like it!!! But the innovation on GPU/NPU acceleration happen first with llama.cpp. I use the llama.cpp `llama-server` instead of ollama, when trying out new things. But you need to manually download your model and start llama-server with the right parameters. As a benefit, llama-server is not only openAI-compatible APIs but also a playground-like webserver. When using llama3.1 8B with its long context, don't forget to limit the context-size, otherwise your RAM-use "explodes" (because of the required KV-cache for the default 128k context). > * The thermals of the Surface Pro 11 tablet forces the Snapdragon X to throttle quite soon, if you max out all the CPUs during running your LLMs. Look at your CPU-utilization (there currently is no CPU-temperature-monitor for the Surfaces). > * if you want GPU-acceleration for you Surface - you might try webGL-based AI (e.g. in crome) > * NPU-accelerated AI on the Surface currently all seems Qualcomm QNN based. Microsoft's Semantic Kernel supports QNN (for C# code, they are working on PYthon-support). > > Hope his helps/clarifies a little (und liebe Grüße aus Wien) Thank you so much for your detailed response and insights! You've clarified a lot of points that were overwhelming and confusing. Even due to the current limitation of the SP11 device, I hope to still develop a container that works on WSL2 on Windows for ARM. I'll take a closer look at using `llama.cpp` with the `llama-server` as you suggested, especially for new experiments. I'll also keep in mind the context-size limits to avoid excessive RAM usage with Llama 3.1. It's a great reminder to check CPU utilization to prevent thermal throttling on the Surface Pro 11. Your advice has given me a lot of direction, and I really appreciate your time and insights!
Author
Owner

@twlswan commented on GitHub (Sep 11, 2024):

  1. There is some work being done in llama.cpp to try and support the QNN code, but its quite far from being workable.

I'm completely out of the loop, but hasnt that PR (#6869) been closed by llama.cpp's maintainer after the PR author complained without tact (I do wonder how much of it was even his intention since his English was clearly A2 level at best...)?

That said, thanks a ton for sharing, it looks like the x elite (especially the skus with 12 cores) is actually pretty good.

<!-- gh-comment-id:2342375517 --> @twlswan commented on GitHub (Sep 11, 2024): > 1. There is some work being done in llama.cpp to try and support the QNN code, but its quite far from being workable. I'm completely out of the loop, but hasnt that PR (#6869) been closed by llama.cpp's maintainer after the PR author complained without tact (I do wonder how much of it was even his intention since his English was clearly A2 level at best...)? That said, thanks a ton for sharing, it looks like the x elite (especially the skus with 12 cores) is actually pretty good.
Author
Owner

@AndreasKunar commented on GitHub (Sep 12, 2024):

… but hasnt that PR (#6869) been closed by llama.cpp's maintainer after the PR author complained without tact (I do wonder how much of it was even his intention since his English was clearly A2 level at best...)? …

Someone in the thread forked his own version and seems still to be working on it, not the PR originator. I‘m currently swamped with other work, but try and get into it deeper in October.

<!-- gh-comment-id:2345430394 --> @AndreasKunar commented on GitHub (Sep 12, 2024): >… but hasnt that PR (#6869) been closed by llama.cpp's maintainer after the PR author complained without tact (I do wonder how much of it was even his intention since his English was clearly A2 level at best...)? … Someone in the thread forked his own version and seems still to be working on it, not the PR originator. I‘m currently swamped with other work, but try and get into it deeper in October.
Author
Owner

@jonathanarava commented on GitHub (Oct 28, 2024):

Can we please bump this ticket up somehow? Or at least, which link can I go to track the development on this.

I am currently using the Ollama 0.3.14 on the Snapdragon X Elite. It is really good on running the llama 3.1, 8b model (even if it offloaded to the CPU and not using the GPU). But obviously I would like for it to use the GPU (talking the comment of running on NPU results in lower tokens/s at face value).

Thank you

<!-- gh-comment-id:2442416268 --> @jonathanarava commented on GitHub (Oct 28, 2024): Can we please bump this ticket up somehow? Or at least, which link can I go to track the development on this. I am currently using the `Ollama 0.3.14` on the Snapdragon X Elite. It is really good on running the llama 3.1, 8b model (even if it offloaded to the CPU and not using the GPU). But obviously I would like for it to use the GPU (talking the comment of running on NPU results in lower tokens/s at face value). Thank you
Author
Owner

@AndreasKunar commented on GitHub (Oct 28, 2024):

Can we please bump this ticket up somehow? Or at least, which link can I go to to track the development on this.

There seems to be no development being done, which can be used for ollama,…

I am currently using the Ollama 0.3.14 on the Snapdragon X Elite. It is really good on running the llama 3.2, 8b model (even if it offloaded to the GPU and not using the CPU). But obviously I would like for it to use the GPU (talking the comment of running on NPU results in lower tokens/s at face value).

I don‘t think that running it on the GPU will run it faster than e.g. the Q4_0_4_4 quantization runs on the CPU. I also have a M2 10-GPU and running Q4_0 on its GPUs has approx. the same token/s performance as my Snapdragon X Elite on the CPU with Q4_0_4_4. Its Ardeno GPU has less horsepower than the M2‘s. So there is little benefit to be had for doing a lot of work, and for very few users - running the GPU on the Snapdragon X via Vulkan on Windows / llama.cpp does not work because of a driver issue. As for supporting the NPU, even ONNX/QNN cannot use the NPU for Llama models - apparently its to complicated, or maybe I was just to stupid to get it to work.

So net my recommendation is - don‘t think the Snapdragon X’s GPU/NPU will get full LLM support by llama.cpp inference anytime soon. The NPU will likely be only useable for very small, dedicated SLMs inside special apps developed with QNN. And all the rest will run (quite fast) on the CPU. Also remember, that LLM inference is largely bound by the memory-bandwidth, and not so much by compute-horsepower, so there is not much to be gained for developing the special GPU code.

<!-- gh-comment-id:2442471166 --> @AndreasKunar commented on GitHub (Oct 28, 2024): > Can we please bump this ticket up somehow? Or at least, which link can I go to to track the development on this. There seems to be no development being done, which can be used for ollama,… > I am currently using the `Ollama 0.3.14` on the Snapdragon X Elite. It is really good on running the llama 3.2, 8b model (even if it offloaded to the GPU and not using the CPU). But obviously I would like for it to use the GPU (talking the comment of running on NPU results in lower tokens/s at face value). I don‘t think that running it on the GPU will run it faster than e.g. the Q4_0_4_4 quantization runs on the CPU. I also have a M2 10-GPU and running Q4_0 on its GPUs has approx. the same token/s performance as my Snapdragon X Elite on the CPU with Q4_0_4_4. Its Ardeno GPU has less horsepower than the M2‘s. So there is little benefit to be had for doing a lot of work, and for very few users - running the GPU on the Snapdragon X via Vulkan on Windows / llama.cpp does not work because of a driver issue. As for supporting the NPU, even ONNX/QNN cannot use the NPU for Llama models - apparently its to complicated, or maybe I was just to stupid to get it to work. So net my recommendation is - don‘t think the Snapdragon X’s GPU/NPU will get full LLM support by llama.cpp inference anytime soon. The NPU will likely be only useable for very small, dedicated SLMs inside special apps developed with QNN. And all the rest will run (quite fast) on the CPU. Also remember, that LLM inference is largely bound by the memory-bandwidth, and not so much by compute-horsepower, so there is not much to be gained for developing the special GPU code.
Author
Owner

@jonathanarava commented on GitHub (Oct 28, 2024):

Thank you for your swift response. Your explanation makes sense.

I agree that memory bandwidth is a critical factor. (rhetorical question) Wouldn't it be more efficient to load the entire model onto the GPU? This approach could potentially minimize the CPU cycles required for data transfer between RAM and the CPU, leading to improved inference times. I understand that the CPU may not be the bottleneck in this scenario, but overall, it would be interesting to see the full capability of using CPU, RAM and the NPU on low spec devices.

Thanks again!

<!-- gh-comment-id:2442515369 --> @jonathanarava commented on GitHub (Oct 28, 2024): Thank you for your swift response. Your explanation makes sense. I agree that memory bandwidth is a critical factor. (rhetorical question) Wouldn't it be more efficient to load the entire model onto the GPU? This approach could potentially minimize the CPU cycles required for data transfer between RAM and the CPU, leading to improved inference times. I understand that the CPU may not be the bottleneck in this scenario, but overall, it would be interesting to see the full capability of using CPU, RAM and the NPU on low spec devices. Thanks again!
Author
Owner

@AndreasKunar commented on GitHub (Oct 28, 2024):

I agree that memory bandwidth is a critical factor. (rhetorical question) Wouldn't it be more efficient to load the entire model onto the GPU? This approach could potentially minimize the CPU cycles required for data transfer between RAM and the CPU, leading to improved inference times. I understand that the CPU may not be the bottleneck in this scenario, but overall, it would be interesting to see the full capability of using CPU, RAM and the NPU on low spec devices.

LLMs generate each new token by computing the entire graph of their artificial neural network again and again. So they have to pump all the entire Billions of parameters, the KV caches (this is something like the AI’s short-term memory, it gets huge / to GBs with large contexts like 128k of llama 3.1),… out of unified RAM memory (these SoCs don‘t have dedicated RAM for the CPU/GPU/NPU) into the quite tiny on-chip caches for processing the computations. This has to happen for each token anew. They can do a lot of computations at the same time (e.g. my M2 Mac‘s GPU has over 1000 units for simultaneous computation / ALUs), so the GPUs idle a lot, waiting for their data from memory. This is why modern SoCs have a RAM bandwidth of 100-130 GByte/s. And yet this still is the bottleneck, even the Snapdragon X CPUs have enough simultaneous matrix-processing units to handle it. The M2 Pro doubles bandwidth to 200, the M2 Max has 400, the M2 Ultra has 800, and the NVIDIA 4090 to over 1000 - that‘s why they are faster.

Only when the LLM processes the prompt initially and during learning/fine-tuning, the LLM can batch the processing for multiple tokens at once, and then the GPUs can totally shine with their horsepower. This is why learning is done on NVIDIA, and the Macs with 96 or 192 GB RAM are perfect for „cheap“ inference of quite large models (NVIDIA RAM is crazy expensive). And a lot of development is done for these, e.g. ollama, llama.cpp,…

<!-- gh-comment-id:2442591983 --> @AndreasKunar commented on GitHub (Oct 28, 2024): > I agree that memory bandwidth is a critical factor. (rhetorical question) Wouldn't it be more efficient to load the entire model onto the GPU? This approach could potentially minimize the CPU cycles required for data transfer between RAM and the CPU, leading to improved inference times. I understand that the CPU may not be the bottleneck in this scenario, but overall, it would be interesting to see the full capability of using CPU, RAM and the NPU on low spec devices. LLMs generate each new token by computing the entire graph of their artificial neural network again and again. So they have to pump all the entire Billions of parameters, the KV caches (this is something like the AI’s short-term memory, it gets huge / to GBs with large contexts like 128k of llama 3.1),… out of unified RAM memory (these SoCs don‘t have dedicated RAM for the CPU/GPU/NPU) into the quite tiny on-chip caches for processing the computations. This has to happen for each token anew. They can do a lot of computations at the same time (e.g. my M2 Mac‘s GPU has over 1000 units for simultaneous computation / ALUs), so the GPUs idle a lot, waiting for their data from memory. This is why modern SoCs have a RAM bandwidth of 100-130 GByte/s. And yet this still is the bottleneck, even the Snapdragon X CPUs have enough simultaneous matrix-processing units to handle it. The M2 Pro doubles bandwidth to 200, the M2 Max has 400, the M2 Ultra has 800, and the NVIDIA 4090 to over 1000 - that‘s why they are faster. Only when the LLM processes the prompt initially and during learning/fine-tuning, the LLM can batch the processing for multiple tokens at once, and then the GPUs can totally shine with their horsepower. This is why learning is done on NVIDIA, and the Macs with 96 or 192 GB RAM are perfect for „cheap“ inference of quite large models (NVIDIA RAM is crazy expensive). And a lot of development is done for these, e.g. ollama, llama.cpp,…
Author
Owner

@jonathanarava commented on GitHub (Oct 31, 2024):

Thank you for the detailed explanation, Andreas! Your insights into the limitations of memory bandwidth and how LLMs process tokens have really helped clarify things. It makes sense that loading the entire model onto the GPU could potentially minimize CPU overhead, but as you pointed out, the underlying architecture of these SoCs complicates that.

Thanks again for your help!

<!-- gh-comment-id:2450057061 --> @jonathanarava commented on GitHub (Oct 31, 2024): Thank you for the detailed explanation, Andreas! Your insights into the limitations of memory bandwidth and how LLMs process tokens have really helped clarify things. It makes sense that loading the entire model onto the GPU could potentially minimize CPU overhead, but as you pointed out, the underlying architecture of these SoCs complicates that. Thanks again for your help!
Author
Owner

@behroozbc commented on GitHub (May 9, 2025):

Is any update available for this issue?

<!-- gh-comment-id:2866598618 --> @behroozbc commented on GitHub (May 9, 2025): Is any update available for this issue?
Author
Owner

@AndreasKunar commented on GitHub (May 11, 2025):

Is any update available for this issue?

Here is the current status of Snapdragon X GPU/NPU support to my knowledge:

  • GPU: llama.cpp's has an openCL backend. But this is still slower than running on the very fast Snapdragon X's CPUs. So it currently makes no sense to use this.
  • NPU: It's still not implemented by llama.cpp. Details/progress see here.
    Microsoft's AI Toolkit for VSCode enables you to play with some NPU-models (see there for new developments). But last time I tested this, it's slow vs. the Snapdragon X CPU's horsepower.

I could not find out, if using the GPU/NPU would yield more power-efficiency while still having good performance. My problem with the Snapdragon X is, that I could not get any power-consumption metrics of its SoC.

I tried to find out, if GPU/NPU/... use can help with power-consumption. Both for prompt-processing/fill (compute-bound) and token-generation/decode (mainly memory-bandwidth bound) - See my medium.com article, where I analyzed my Apples's M-series SoC and the very power-efficient NVIDIA Jetson Orin SoC (but did not have any chance to measure SoC,.. power-draw on my Snapdragon X machine).

<!-- gh-comment-id:2869644240 --> @AndreasKunar commented on GitHub (May 11, 2025): > Is any update available for this issue? Here is the current status of Snapdragon X GPU/NPU support to my knowledge: - **GPU**: llama.cpp's has an [openCL backend](https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/OPENCL.md). But this is **still slower** than running on the very fast Snapdragon X's CPUs. So **it currently makes no sense to use this**. - **NPU**: It's **still not implemented by llama.cpp**. Details/progress see [here](https://github.com/ggml-org/llama.cpp/issues/7772). Microsoft's [AI Toolkit for VSCode](https://marketplace.visualstudio.com/items?itemName=ms-windows-ai-studio.windows-ai-studio) enables you to play with some NPU-models (see there for new developments). But last time I tested this, it's slow vs. the Snapdragon X CPU's horsepower. I could not find out, if using the GPU/NPU would yield more power-efficiency while still having good performance. My problem with the Snapdragon X is, that I could not get any power-consumption metrics of its SoC. I tried to find out, if GPU/NPU/... use can help with power-consumption. Both for prompt-processing/fill (compute-bound) and token-generation/decode (mainly memory-bandwidth bound) - See [my medium.com article](https://medium.com/@andreask_75652/power-consumption-of-our-ai-use-f2b1f9bce97b), where I analyzed my Apples's M-series SoC and the very power-efficient NVIDIA Jetson Orin SoC (but did not have any chance to measure SoC,.. power-draw on my Snapdragon X machine).
Author
Owner

@chraac commented on GitHub (May 11, 2025):

I tried to find out, if GPU/NPU/... use can help with power-consumption. Both for prompt-processing/fill (compute-bound) and token-generation/decode (mainly memory-bandwidth bound) - See my medium.com article, where I analyzed my Apples's M-series SoC and the very power-efficient NVIDIA Jetson Orin SoC (but did not have any chance to measure SoC,.. power-draw on my Snapdragon X machine).

Looks hwinfo64 now can run on windows arm laptop, don't konw whether there're some power metrics available
Image

<!-- gh-comment-id:2869843040 --> @chraac commented on GitHub (May 11, 2025): > I tried to find out, if GPU/NPU/... use can help with power-consumption. Both for prompt-processing/fill (compute-bound) and token-generation/decode (mainly memory-bandwidth bound) - See [my medium.com article](https://medium.com/@andreask_75652/power-consumption-of-our-ai-use-f2b1f9bce97b), where I analyzed my Apples's M-series SoC and the very power-efficient NVIDIA Jetson Orin SoC (but did not have any chance to measure SoC,.. power-draw on my Snapdragon X machine). Looks [hwinfo64](https://www.hwinfo.com/forum/threads/arm64-windows-compatible-app.9506/) now can run on windows arm laptop, don't konw whether there're some power metrics available ![Image](https://github.com/user-attachments/assets/cef6a4e2-47f9-46da-aa37-1d35df0cc428)
Author
Owner

@AndreasKunar commented on GitHub (May 11, 2025):

I tried to find out, if GPU/NPU/... use can help with power-consumption. Both for prompt-processing/fill (compute-bound) and token-generation/decode (mainly memory-bandwidth bound) - See my medium.com article, where I analyzed my Apples's M-series SoC and the very power-efficient NVIDIA Jetson Orin SoC (but did not have any chance to measure SoC,.. power-draw on my Snapdragon X machine).

Looks hwinfo64 now can run on windows arm laptop, don't konw whether there're some power metrics available

Thanks. I'm using HWMonitor on the Surface, and this provides a lot of sensory-information for arm64, but no power-consumption details yet. HWMonitor is adding new sensors from time to time. I have not tried hwinfo64 yet. Windows' powercfg /SYSTEMPOWERREPORT displays some "Energy Meter" data for the CPU und GPU, but I could not figure out, how to use this, and if the NPU does an enegy-meter input.

Overall it's to complicate / too much effort for me on the Snapdragon X On Apple/NVIDIA hardware it's easier. On Macs there's e.g. github: tlkh/asitop or exelban/stats. On the NVIDIA Jetson, there is rbonghi/jetson_stats. Both provide SoC power-consumption values which I could use together with running llama.cpp performance-measurements.

<!-- gh-comment-id:2869945816 --> @AndreasKunar commented on GitHub (May 11, 2025): > > I tried to find out, if GPU/NPU/... use can help with power-consumption. Both for prompt-processing/fill (compute-bound) and token-generation/decode (mainly memory-bandwidth bound) - See [my medium.com article](https://medium.com/@andreask_75652/power-consumption-of-our-ai-use-f2b1f9bce97b), where I analyzed my Apples's M-series SoC and the very power-efficient NVIDIA Jetson Orin SoC (but did not have any chance to measure SoC,.. power-draw on my Snapdragon X machine). > > Looks [hwinfo64](https://www.hwinfo.com/forum/threads/arm64-windows-compatible-app.9506/) now can run on windows arm laptop, don't konw whether there're some power metrics available Thanks. I'm using HWMonitor on the Surface, and this provides a lot of sensory-information for arm64, but no power-consumption details yet. HWMonitor is adding new sensors from time to time. I have not tried hwinfo64 yet. Windows' `powercfg /SYSTEMPOWERREPORT` displays some "Energy Meter" data for the CPU und GPU, but I could not figure out, how to use this, and if the NPU does an enegy-meter input. Overall it's to complicate / too much effort for me on the Snapdragon X On Apple/NVIDIA hardware it's easier. On Macs there's e.g. github: tlkh/asitop or exelban/stats. On the NVIDIA Jetson, there is rbonghi/jetson_stats. Both provide SoC power-consumption values which I could use together with running llama.cpp performance-measurements.
Author
Owner

@samirgaire10 commented on GitHub (Jun 7, 2025):

Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please
Have faster support for snapdragon NPU and GPU ,,, AAAAAAAAAAAAAAAA We urgently need better and faster support for Snapdragon's NPU and GPU — please make this a top priority!

<!-- gh-comment-id:2952368789 --> @samirgaire10 commented on GitHub (Jun 7, 2025): Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Have faster support for snapdragon NPU and GPU ,,, AAAAAAAAAAAAAAAA We urgently need better and faster support for Snapdragon's NPU and GPU — please make this a top priority!
Author
Owner

@sXe79 commented on GitHub (Dec 18, 2025):

Hello, I finally decided to try a local LLM, just to find out the NPU of my Surface Laptop 7 XElite is 0% used. Meh :/

<!-- gh-comment-id:3670932008 --> @sXe79 commented on GitHub (Dec 18, 2025): Hello, I finally decided to try a local LLM, just to find out the NPU of my Surface Laptop 7 XElite is 0% used. Meh :/
Author
Owner

@rpascalsdl commented on GitHub (Dec 19, 2025):

@sXe79 to be fair, the state of NPUs is laughable at best. From my testing on a Snapdragon, you're better off running LLMs on CPU.

If you really want to give NPUs a try you have the option of Nexa SDK, or using VS Code with AI Toolkit extension. Just make sure you use models that are marked with Qualcomm NPU support.

Last note, always run LLMs when plugged in. For some reason, at least my laptop, throttles it down so much that it's completely unusable.

<!-- gh-comment-id:3673729542 --> @rpascalsdl commented on GitHub (Dec 19, 2025): @sXe79 to be fair, the state of NPUs is laughable at best. From my testing on a Snapdragon, you're better off running LLMs on CPU. If you really want to give NPUs a try you have the option of Nexa SDK, or using VS Code with AI Toolkit extension. Just make sure you use models that are marked with Qualcomm NPU support. Last note, always run LLMs when plugged in. For some reason, at least my laptop, throttles it down so much that it's completely unusable.
Author
Owner

@xgdgsc commented on GitHub (Dec 19, 2025):

https://github.com/microsoft/Foundry-Local already works for xelite NPU. Although the service isn' t as stable, I get crashes. But you can try out.

<!-- gh-comment-id:3673901265 --> @xgdgsc commented on GitHub (Dec 19, 2025): https://github.com/microsoft/Foundry-Local already works for xelite NPU. Although the service isn' t as stable, I get crashes. But you can try out.
Author
Owner

@lyleschemmerling commented on GitHub (Dec 19, 2025):

https://anythingllm.com/ can also run the NPU, which is interesting because their main backend is ollama. But it is not impressive, the CPU usually wins in an apples to apples comparison.

I have gotten GPU acceleration to work on the Adreno. It was a huge pain for negligible performance gain, and the drivers are not stable. People are still working on it, but I doubt the juice will be worth the squeeze.

The Snapdragon is pretty well optimized as far as CPUs go. Stick with that.

<!-- gh-comment-id:3675198806 --> @lyleschemmerling commented on GitHub (Dec 19, 2025): https://anythingllm.com/ can also run the NPU, which is interesting because their main backend is ollama. But it is not impressive, the CPU usually wins in an apples to apples comparison. I have gotten GPU acceleration to work on the Adreno. It was a huge pain for negligible performance gain, and the drivers are not stable. People are still working on it, but I doubt the juice will be worth the squeeze. The Snapdragon is pretty well optimized as far as CPUs go. Stick with that.
Author
Owner

@BootsSiR commented on GitHub (Dec 19, 2025):

@sXe79 to be fair, the state of NPUs is laughable at best. From my testing on a Snapdragon, you're better off running LLMs on CPU.

If you really want to give NPUs a try you have the option of Nexa SDK, or using VS Code with AI Toolkit extension. Just make sure you use models that are marked with Qualcomm NPU support.

Last note, always run LLMs when plugged in. For some reason, at least my laptop, throttles it down so much that it's completely unusable.

I tested LLM's with both CPU and NPU on my Snapdragon device and the CPU crushed the NPU in terms of performance.

<!-- gh-comment-id:3675214923 --> @BootsSiR commented on GitHub (Dec 19, 2025): > [@sXe79](https://github.com/sXe79) to be fair, the state of NPUs is laughable at best. From my testing on a Snapdragon, you're better off running LLMs on CPU. > > If you really want to give NPUs a try you have the option of Nexa SDK, or using VS Code with AI Toolkit extension. Just make sure you use models that are marked with Qualcomm NPU support. > > Last note, always run LLMs when plugged in. For some reason, at least my laptop, throttles it down so much that it's completely unusable. I tested LLM's with both CPU and NPU on my Snapdragon device and the CPU crushed the NPU in terms of performance.
Author
Owner

@rpascalsdl commented on GitHub (Dec 19, 2025):

Since the subject of GPU was raised, GPT-OSS 20B runs at a very acceptable up to 20 tokens per second using Nexa SDK - compared to about 3-4 on CPU. But I do suspect they're doing some dark magic to make it happen. Maybe that's why it's not in their public list of models, but if you go through their X posts, you can figure out how to run it.

<!-- gh-comment-id:3675296386 --> @rpascalsdl commented on GitHub (Dec 19, 2025): Since the subject of GPU was raised, GPT-OSS 20B runs at a very acceptable up to 20 tokens per second using Nexa SDK - compared to about 3-4 on CPU. But I do suspect they're doing some dark magic to make it happen. Maybe that's why it's not in their public list of models, but if you go through their X posts, you can figure out how to run it.
Author
Owner

@lyleschemmerling commented on GitHub (Dec 19, 2025):

Interesting. I might give it another shot this weekend. If I succeed I'll try to update this thread.

<!-- gh-comment-id:3675352824 --> @lyleschemmerling commented on GitHub (Dec 19, 2025): Interesting. I might give it another shot this weekend. If I succeed I'll try to update this thread.
Author
Owner
<!-- gh-comment-id:4010275480 --> @rjtokenring commented on GitHub (Mar 6, 2026): WIP: https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/snapdragon/README.md
Author
Owner

@behroozbc commented on GitHub (Apr 9, 2026):

Is any update available for this issue?

<!-- gh-comment-id:4213564878 --> @behroozbc commented on GitHub (Apr 9, 2026): Is any update available for this issue?
Author
Owner

@arudaev commented on GitHub (Apr 9, 2026):

Is any update available for this issue?

good question, I think the simple answer is, NPU isn't made to run LLM, or GGUF models on it... its made to run built in AI features with same speed as before, but less energy consumption and to not impact the CPU/GPU

<!-- gh-comment-id:4213957904 --> @arudaev commented on GitHub (Apr 9, 2026): > Is any update available for this issue? good question, I think the simple answer is, NPU isn't made to run LLM, or GGUF models on it... its made to run built in AI features with same speed as before, but less energy consumption and to not impact the CPU/GPU
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3357