mirror of
https://github.com/ollama/ollama.git
synced 2026-05-05 23:53:43 -05:00
Open
opened 2026-04-12 13:58:21 -05:00 by GiteaMirror
·
42 comments
No Branch/Tag Specified
main
dhiltgen/ci
hoyyeva/editor-config-repair
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
hoyyeva/launch-backup-ux
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-mlx-decode-checkpoints
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
brucemacd/download-before-remove
parth/update-claude-docs
parth-anthropic-reference-images-path
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#3357
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @flyfox666 on GitHub (Jun 28, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5360
Originally assigned to: @dhiltgen on GitHub.
Hi all.
I just got a Microsoft laptop7, the AIPC, with Snapdragon X Elite, NPU, Adreno GPU. It is an ARM based system.
But I found that NPU is not running when using Ollama.
Would it be supported by Ollama for the NPU and GPU?
@tholum commented on GitHub (Jun 28, 2024):
I think more then support for The gpu, I think the Hexagon NPU would be better to support
@flyfox666 commented on GitHub (Jun 29, 2024):
Yeap , the NPU is better
@leejw51 commented on GitHub (Jun 30, 2024):
on samsung galaxybook4 snapdargon x elite
ollama is too slow
@Srafington commented on GitHub (Jun 30, 2024):
Those wanting a bit more oomf before this issue is addressed should run Ollama via WSL as there are native ARM binaries for Linux. They still won't support the NPU or GPU, but it is still much faster than running the Windows x86-64 binaries through emulation. SLMs like Phi are very speedy when run this way
@dhiltgen commented on GitHub (Jul 3, 2024):
We don't yet have an official arm windows binary, but you should be able to build from source until we do.
@danilofalcao commented on GitHub (Jul 7, 2024):
I would be available to test any developments on that matter if necessary.
@dhiltgen commented on GitHub (Jul 22, 2024):
Once #5712 merges we'll have official support for running in CPU mode on the Snapdragon systems, but additional PR(s) will need to merge upstream in llama.cpp before NPU/GPU support can be enabled. On my test system, I'm seeing ~18-20TPS on llama3 on the CPU, so it's usable. My understanding is the NPU may actually be slightly slower, although much more power efficient.
@AndreasKunar commented on GitHub (Jul 23, 2024):
Please note, that recent llama.cpp technology innovations with Q4_0_4_8 quantization on Snapdragon X CPUs give nearly the same performance or more as Q4_0 on base Apple Silicon with GPUs, see accelerating Q4_0 CPU performance 2-2.5x.
I also tried to get llama.cpp GPU-acceleration to work on Snapdragon X via Vulkan, but it's not working (yet) - see llama.cpp issue #8455.
@AndreasKunar commented on GitHub (Jul 23, 2024):
E.g. here the performance of a Snapdragon X Plus (CPU-only, but Q4_0_4_8 optimized) vs. a 10-core M2 (CPU and GPU) for the new Llama3-8B Groq-Tool-use optimized local LLM. Yes, the Plus is still slower than the M2, but not by much, and the Elite is probably faster.
Snapdragon X Plus, Surface 11 Pro 16GB, Windows 11 24H2, MSVC+clang, llama.cpp build: 081fe431 (3441):
M2 10GPU, MacBook Air 24GB, MacOS 14.5, llama.cpp build: 081fe431 (3441):
P.S: llama.cpp Q4_0_4_8 conversion is done via
./llama-quantize --allow-requantize <q4_0 model-name> <q4_0_4_8 name> Q4_0_4_8P.P.S: token-generation (tg) is largely memory-bandwidth bound, while prompt-processing (pp) is compute-horsepower dependent.
@flyfox666 commented on GitHub (Jul 24, 2024):
> Once #5712 merges we'll have official support for running in CPU mode on the Snapdragon systems, but additional PR(s) will need to merge upstream in llama.cpp before NPU/GPU support can be enabled. On my test system, I'm seeing ~18-20TPS on llama3 on the CPU, so it's usable. My understanding is the NPU may actually be slightly slower, although much more power efficient.
hi thanks for the reply . Looking forward to it.
@flyfox666 commented on GitHub (Jul 24, 2024):
thanks a lot
@Hassansaleh22 commented on GitHub (Jul 27, 2024):
Thanks all,
Do you have an estimated timeline for when the necessary pull requests (#5712 and others for NPU/GPU support) will be merged? Also, will we need to uninstall the current version before updating to get native ARM working without emulation?
@SebastianGode commented on GitHub (Aug 1, 2024):
@AndreasKunar Importing the Q4_0_4_8 build under WSL to native ARM of Ollama doesn't seem to work.
Ollama doesn't support Q4_0_4_8 yet, correct?
@AndreasKunar commented on GitHub (Aug 1, 2024):
Q4_0_4_8 requires an arm64 compile of llama.cpp (Linux and Windows). And for Windows it requires a build with clang, since MSVC does not support the required inline asm for arm64. See the llama.cpp build instructions. I don't know how ollama builds, and if the llama.cpp component's build-process correctly builds for Windows on ARM - I have not tested PR#5712 yet.
Building for Snapdragon X in WSL2 with e.g. Ubuntu is commonly much easier, and its not slower than in native windows. Just don't forget to allocate CPUs and memory to the WSL2 in
%USERPROFILE%\.wslconfig:I will try and build ollama in WSL2 on my Surface and try and import+use a Q4_0_4_8 model.
@AndreasKunar commented on GitHub (Aug 1, 2024):
@SebastianGode - I tried to build ollama on WSL2/Ubuntu24.04 on my Surface 11 Pro and test it with Q4_0_4_8.
Ollama+llama.cpp builds, imports my local llama-2 Q4_0, and runs it.
But when I try and import my local llama-2 Q4_0_4_8 model (which runs with llama.cpp), it gives an "Error: invalid file magic" from its ggml.go module (at line#311), which does not seem to understand the new Q4_0_4_4 and Q4_0_4_8 formats.
Should we raise an issue?
@SebastianGode commented on GitHub (Aug 1, 2024):
@AndreasKunar Yes, that is the exact same issue for me. Good that you could verify that and that I wasn't too dumb to use Ollama.
Please go ahead and open an issue. I assume this shouldn't be that hard to fix, likely just some dependency which would need to be updated (but that's just my assumption).
@Berowne commented on GitHub (Aug 28, 2024):
I'm keen to stand on the shoulders of giants. I've subscribed to this thread! Keep up the good work.
@arudaev commented on GitHub (Sep 9, 2024):
I'm new to using
llama.cppand related tools. After testing my device, I'm satisfied with its performance, but Ollama is running very slowly. My goal is to set up a Docker container that leverages WSL2 to run Llama 3 (7B) efficiently and maximize performance. However, the available resources are overwhelming and unclear. I need a streamlined solution to run an Ollama container with optimal speed and accuracy.@AndreasKunar commented on GitHub (Sep 9, 2024):
The Snapdragon X however supports accelerated execution via the CPU, but this is in its very early stages with the core llama.cpp-code ollama uses. This CPU-acceleration is mainly for prompt-processing (2-3x faster), and LLM token-generation is more dependent on memory-bandwidth and not improved much. And it currently requires a special model-format (quantized as Q4_0_4_8 instead of Q4_0).
ollama on Windows (not WSL2) is currently in preview. You manually need to compile it, if you want it to run natively on Windows for ARM - I would not recommended this for beginners. The installation you use might run emulated as x64-code.
WSL2 needs to be configured accordingly (file
.wslconfigin your Windows user directory), in order to use the right amount of RAM (setting: memory, default is only 50%) and all CPUs (setting: processors, I suggest 12). You need to make sure, that you use an aarch64/arm64 ollama linux image. But ollama builds/runs really well on WSL2-linux. Running ollama in a correctly configured WSL2 is as fast (maybe even faster) as natively. There are performance penalties, if you don't store your files natively in WSL2/Linux. WSL2 currently does NOT support GPU/NPU acceleration for the Snapdragon X, but it supports CPU-acceleration of the llama.cpp-codedocker - I have no experience with running ollama on WSL2-based docker on Windows for ARM.
A few personal notes on the Surface Pro 11 and ollama/llama.cpp:
ollama is a great shell for reducing the complexity of the base llama.cpp code and I really like it!!! But the innovation on GPU/NPU acceleration happen first with llama.cpp. I use the llama.cpp
llama-serverinstead of ollama, when trying out new things. But you need to manually download your model and start llama-server with the right parameters. As a benefit, llama-server is not only openAI-compatible APIs but also a playground-like webserver. When using llama3.1 8B with its long context, don't forget to limit the context-size, otherwise your RAM-use "explodes" (because of the required KV-cache for the default 128k context).The thermals of the Surface Pro 11 tablet forces the Snapdragon X to throttle quite soon, if you max out all the CPUs during running your LLMs. Look at your CPU-utilization (there currently is no CPU-temperature-monitor for the Surfaces).
if you want GPU-acceleration for you Surface - you might try webGL-based AI (e.g. in crome)
NPU-accelerated AI on the Surface currently all seems Qualcomm QNN based. Microsoft's Semantic Kernel supports QNN (for C# code, they are working on PYthon-support).
Hope his helps/clarifies a little (und liebe Grüße aus Wien)
@arudaev commented on GitHub (Sep 9, 2024):
Thank you so much for your detailed response and insights! You've clarified a lot of points that were overwhelming and confusing. Even due to the current limitation of the SP11 device, I hope to still develop a container that works on WSL2 on Windows for ARM.
I'll take a closer look at using
llama.cppwith thellama-serveras you suggested, especially for new experiments. I'll also keep in mind the context-size limits to avoid excessive RAM usage with Llama 3.1. It's a great reminder to check CPU utilization to prevent thermal throttling on the Surface Pro 11.Your advice has given me a lot of direction, and I really appreciate your time and insights!
@twlswan commented on GitHub (Sep 11, 2024):
I'm completely out of the loop, but hasnt that PR (#6869) been closed by llama.cpp's maintainer after the PR author complained without tact (I do wonder how much of it was even his intention since his English was clearly A2 level at best...)?
That said, thanks a ton for sharing, it looks like the x elite (especially the skus with 12 cores) is actually pretty good.
@AndreasKunar commented on GitHub (Sep 12, 2024):
Someone in the thread forked his own version and seems still to be working on it, not the PR originator. I‘m currently swamped with other work, but try and get into it deeper in October.
@jonathanarava commented on GitHub (Oct 28, 2024):
Can we please bump this ticket up somehow? Or at least, which link can I go to track the development on this.
I am currently using the
Ollama 0.3.14on the Snapdragon X Elite. It is really good on running the llama 3.1, 8b model (even if it offloaded to the CPU and not using the GPU). But obviously I would like for it to use the GPU (talking the comment of running on NPU results in lower tokens/s at face value).Thank you
@AndreasKunar commented on GitHub (Oct 28, 2024):
There seems to be no development being done, which can be used for ollama,…
I don‘t think that running it on the GPU will run it faster than e.g. the Q4_0_4_4 quantization runs on the CPU. I also have a M2 10-GPU and running Q4_0 on its GPUs has approx. the same token/s performance as my Snapdragon X Elite on the CPU with Q4_0_4_4. Its Ardeno GPU has less horsepower than the M2‘s. So there is little benefit to be had for doing a lot of work, and for very few users - running the GPU on the Snapdragon X via Vulkan on Windows / llama.cpp does not work because of a driver issue. As for supporting the NPU, even ONNX/QNN cannot use the NPU for Llama models - apparently its to complicated, or maybe I was just to stupid to get it to work.
So net my recommendation is - don‘t think the Snapdragon X’s GPU/NPU will get full LLM support by llama.cpp inference anytime soon. The NPU will likely be only useable for very small, dedicated SLMs inside special apps developed with QNN. And all the rest will run (quite fast) on the CPU. Also remember, that LLM inference is largely bound by the memory-bandwidth, and not so much by compute-horsepower, so there is not much to be gained for developing the special GPU code.
@jonathanarava commented on GitHub (Oct 28, 2024):
Thank you for your swift response. Your explanation makes sense.
I agree that memory bandwidth is a critical factor. (rhetorical question) Wouldn't it be more efficient to load the entire model onto the GPU? This approach could potentially minimize the CPU cycles required for data transfer between RAM and the CPU, leading to improved inference times. I understand that the CPU may not be the bottleneck in this scenario, but overall, it would be interesting to see the full capability of using CPU, RAM and the NPU on low spec devices.
Thanks again!
@AndreasKunar commented on GitHub (Oct 28, 2024):
LLMs generate each new token by computing the entire graph of their artificial neural network again and again. So they have to pump all the entire Billions of parameters, the KV caches (this is something like the AI’s short-term memory, it gets huge / to GBs with large contexts like 128k of llama 3.1),… out of unified RAM memory (these SoCs don‘t have dedicated RAM for the CPU/GPU/NPU) into the quite tiny on-chip caches for processing the computations. This has to happen for each token anew. They can do a lot of computations at the same time (e.g. my M2 Mac‘s GPU has over 1000 units for simultaneous computation / ALUs), so the GPUs idle a lot, waiting for their data from memory. This is why modern SoCs have a RAM bandwidth of 100-130 GByte/s. And yet this still is the bottleneck, even the Snapdragon X CPUs have enough simultaneous matrix-processing units to handle it. The M2 Pro doubles bandwidth to 200, the M2 Max has 400, the M2 Ultra has 800, and the NVIDIA 4090 to over 1000 - that‘s why they are faster.
Only when the LLM processes the prompt initially and during learning/fine-tuning, the LLM can batch the processing for multiple tokens at once, and then the GPUs can totally shine with their horsepower. This is why learning is done on NVIDIA, and the Macs with 96 or 192 GB RAM are perfect for „cheap“ inference of quite large models (NVIDIA RAM is crazy expensive). And a lot of development is done for these, e.g. ollama, llama.cpp,…
@jonathanarava commented on GitHub (Oct 31, 2024):
Thank you for the detailed explanation, Andreas! Your insights into the limitations of memory bandwidth and how LLMs process tokens have really helped clarify things. It makes sense that loading the entire model onto the GPU could potentially minimize CPU overhead, but as you pointed out, the underlying architecture of these SoCs complicates that.
Thanks again for your help!
@behroozbc commented on GitHub (May 9, 2025):
Is any update available for this issue?
@AndreasKunar commented on GitHub (May 11, 2025):
Here is the current status of Snapdragon X GPU/NPU support to my knowledge:
Microsoft's AI Toolkit for VSCode enables you to play with some NPU-models (see there for new developments). But last time I tested this, it's slow vs. the Snapdragon X CPU's horsepower.
I could not find out, if using the GPU/NPU would yield more power-efficiency while still having good performance. My problem with the Snapdragon X is, that I could not get any power-consumption metrics of its SoC.
I tried to find out, if GPU/NPU/... use can help with power-consumption. Both for prompt-processing/fill (compute-bound) and token-generation/decode (mainly memory-bandwidth bound) - See my medium.com article, where I analyzed my Apples's M-series SoC and the very power-efficient NVIDIA Jetson Orin SoC (but did not have any chance to measure SoC,.. power-draw on my Snapdragon X machine).
@chraac commented on GitHub (May 11, 2025):
Looks hwinfo64 now can run on windows arm laptop, don't konw whether there're some power metrics available

@AndreasKunar commented on GitHub (May 11, 2025):
Thanks. I'm using HWMonitor on the Surface, and this provides a lot of sensory-information for arm64, but no power-consumption details yet. HWMonitor is adding new sensors from time to time. I have not tried hwinfo64 yet. Windows'
powercfg /SYSTEMPOWERREPORTdisplays some "Energy Meter" data for the CPU und GPU, but I could not figure out, how to use this, and if the NPU does an enegy-meter input.Overall it's to complicate / too much effort for me on the Snapdragon X On Apple/NVIDIA hardware it's easier. On Macs there's e.g. github: tlkh/asitop or exelban/stats. On the NVIDIA Jetson, there is rbonghi/jetson_stats. Both provide SoC power-consumption values which I could use together with running llama.cpp performance-measurements.
@samirgaire10 commented on GitHub (Jun 7, 2025):
Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please
Have faster support for snapdragon NPU and GPU ,,, AAAAAAAAAAAAAAAA We urgently need better and faster support for Snapdragon's NPU and GPU — please make this a top priority!
@sXe79 commented on GitHub (Dec 18, 2025):
Hello, I finally decided to try a local LLM, just to find out the NPU of my Surface Laptop 7 XElite is 0% used. Meh :/
@rpascalsdl commented on GitHub (Dec 19, 2025):
@sXe79 to be fair, the state of NPUs is laughable at best. From my testing on a Snapdragon, you're better off running LLMs on CPU.
If you really want to give NPUs a try you have the option of Nexa SDK, or using VS Code with AI Toolkit extension. Just make sure you use models that are marked with Qualcomm NPU support.
Last note, always run LLMs when plugged in. For some reason, at least my laptop, throttles it down so much that it's completely unusable.
@xgdgsc commented on GitHub (Dec 19, 2025):
https://github.com/microsoft/Foundry-Local already works for xelite NPU. Although the service isn' t as stable, I get crashes. But you can try out.
@lyleschemmerling commented on GitHub (Dec 19, 2025):
https://anythingllm.com/ can also run the NPU, which is interesting because their main backend is ollama. But it is not impressive, the CPU usually wins in an apples to apples comparison.
I have gotten GPU acceleration to work on the Adreno. It was a huge pain for negligible performance gain, and the drivers are not stable. People are still working on it, but I doubt the juice will be worth the squeeze.
The Snapdragon is pretty well optimized as far as CPUs go. Stick with that.
@BootsSiR commented on GitHub (Dec 19, 2025):
I tested LLM's with both CPU and NPU on my Snapdragon device and the CPU crushed the NPU in terms of performance.
@rpascalsdl commented on GitHub (Dec 19, 2025):
Since the subject of GPU was raised, GPT-OSS 20B runs at a very acceptable up to 20 tokens per second using Nexa SDK - compared to about 3-4 on CPU. But I do suspect they're doing some dark magic to make it happen. Maybe that's why it's not in their public list of models, but if you go through their X posts, you can figure out how to run it.
@lyleschemmerling commented on GitHub (Dec 19, 2025):
Interesting. I might give it another shot this weekend. If I succeed I'll try to update this thread.
@rjtokenring commented on GitHub (Mar 6, 2026):
WIP: https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/snapdragon/README.md
@behroozbc commented on GitHub (Apr 9, 2026):
Is any update available for this issue?
@arudaev commented on GitHub (Apr 9, 2026):
good question, I think the simple answer is, NPU isn't made to run LLM, or GGUF models on it... its made to run built in AI features with same speed as before, but less energy consumption and to not impact the CPU/GPU