mirror of
https://github.com/ollama/ollama.git
synced 2026-05-06 08:02:14 -05:00
Open
opened 2026-04-22 02:22:33 -05:00 by GiteaMirror
·
171 comments
No Branch/Tag Specified
main
parth-mlx-decode-checkpoints
dhiltgen/ci
hoyyeva/editor-config-repair
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
hoyyeva/launch-backup-ux
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
brucemacd/download-before-remove
parth/update-claude-docs
parth-anthropic-reference-images-path
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#26257
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @J0hnny007 on GitHub (Nov 6, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1016
Originally assigned to: @dhiltgen on GitHub.
I'm currently trying out the ollama app on my iMac (i7/Vega64) and I can't seem to get it to use my GPU.
I have tried running it with num_gpu 1 but that generated the warnings below.
2023/11/06 16:06:33 llama.go:384: starting llama runner 2023/11/06 16:06:33 llama.go:386: error starting the external llama runner: fork/exec /var/folders/2z/r_0t221x2blbq02n5dp2m5fr0000gn/T/ollama1975281143/llama.cpp/gguf/build/metal/bin/ollama-runner: bad CPU type in executable 2023/11/06 16:06:33 llama.go:384: starting llama runner 2023/11/06 16:06:33 llama.go:442: waiting for llama runner to start responding {"timestamp":1699283193,"level":"WARNING","function":"server_params_parse","line":873,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1} {"timestamp":1699283193,"level":"INFO","function":"main","line":1324,"message":"build info","build":219,"commit":"9e70cc0"} {"timestamp":1699283193,"level":"INFO","function":"main","line":1330,"message":"system info","n_threads":6,"n_threads_batch":-1,"total_threads":12,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}@BruceMacD commented on GitHub (Nov 7, 2023):
Hi @J0hnny007, thanks for opening the issue. Ollama only supports the Metal GPU API on Macs right now. AMD GPUs won't work.
@J0hnny007 commented on GitHub (Nov 7, 2023):
Good to know, though I thought that mps can use AMD GPUs. Oh well, thanks for the info.
@cmarhoover commented on GitHub (Dec 13, 2023):
Apple's "Metal Overview" page has the following hardware support list in the page footer:
Despite being listed as supporting Metal 3, I can confirm that Ollama does not currently use the Radeon RX 6900 in my Mac Pro system.
@Basten7 commented on GitHub (Dec 16, 2023):
Me too, I confirm that Ollama does not use the Radeon RX 6800X on Mac Pro when Parameter is set to "PARAMETER num_gpu 1" in Modelfile.
@cracksauce commented on GitHub (Dec 20, 2023):
Are there any plans for Ollama to support this type of hardware setup (AMD GPUs on Intel Mac)?
@ucodia commented on GitHub (Jan 1, 2024):
Intel Mac with AMD graphics card do have support for Metal 3 as the screenshot below attest
Though as previously reported, Ollama does not seem to be able to leverage AMD GPU despite having API support on MacOS.
@J0hnny007 Could we please reopen this issue as it was closed on the assumption that AMD GPU were not compatible with Metal?
@pjv commented on GitHub (Jan 2, 2024):
Some possibly relevant data: on my intel iMac pro with AMD Radeon Pro Vega 8GB vram, if I build the current head of llama.cpp with
make CUBLAS=1the resulting.mainbinary will run models with the GPU.@cracksauce commented on GitHub (Jan 2, 2024):
Could you describe how to do that for those of us who are less technical? Would appreciate it- thanks!
@pjv commented on GitHub (Jan 3, 2024):
@cracksauce my report wasn’t a how-to fix for ollama. It was a pointer to the ollama developers that may allow them to tweak how they build one of the ollama dependencies in a way that could possibly allow ollama to make use of AMD GPUs on intel macs.
If you are interested in building and running llama.cpp directly, you should check out that project’s repo.
@leobenkel commented on GitHub (Jan 3, 2024):
Hello,
I have a mac OSX with these spec:
Has anyone been able to find a solution on how to run ollama docker image to be using the GPU ? I have not found a tutorial that works. I tried following the nvidia one which obviously did not work
@cracksauce commented on GitHub (Jan 7, 2024):
@leobenkel Seems like there might be a potential adjustment that devs can make to one of the Ollama dependency builds to take advantage of AMD GPU's utilization of Metal 3 on Intel Macs. TBD I suppose!
cc @J0hnny007 @BruceMacD
Some other possible fixes and random tweaks after perusing llama.cpp repo:
https://github.com/ggerganov/llama.cpp/issues/2965#issuecomment-1763223051
https://github.com/ggerganov/llama.cpp/issues/3000
https://github.com/ggerganov/llama.cpp/issues/3129#issuecomment-1848436692
https://github.com/ggerganov/llama.cpp/pull/1435#issuecomment-1546928978
https://github.com/ggerganov/llama.cpp/issues/1429#issuecomment-1805455807
@leobenkel commented on GitHub (Jan 12, 2024):
Thank you @cracksauce , that would be great ! :)
@dhiltgen commented on GitHub (Jan 15, 2024):
PR #2007 once merged likely provides a foundation upon which we could potentially support this.
Much like the gen_linux.sh script, we could augment the gen_darwin.sh script in the x86 case to look for the underlying GPU libraries on the build system, and if detected, build a variant of llama.cpp with the appropriate flags. The detection logic would likely need some adjustments as well for intel macs.
@birchcode commented on GitHub (Feb 16, 2024):
Would love this. Running a 6900xt here.
Any way we can help?
@dhiltgen commented on GitHub (Feb 16, 2024):
The biggest unknown in my mind is viability of the underlying GPU libraries CUDA/ROCm on Intel MacOS. When Apple released the M's with integrated GPUs, they alienated both AMD and NVIDIA, so neither company is going to support their libraries going forward on Intel Macs. So really the question is what was the last supported version, and is that version viable to build llama.cpp? So I think the answer to your question is, try to see if you can get llama.cpp upstream to build on your Intel mac with the last supported version of ROCm and leverage your Radeon GPU. If that works, then my guidance above on the build scripts would apply to wiring that into our build process.
I'm not sure we'd integrated this into our official builds given the sunsetting nature of this compatibility matrix, but I think we'd be open to improvements to the build scripts so that people can build from source on Intel Macs and get GPU acceleration.
@birchcode commented on GitHub (Feb 18, 2024):
The alienation explains some things.Yes, ROCm I think has never been supported on Apple. Would have to boot into linux(my next option) to use that. But we have Metal - not sure what the mileage will be.
I was able to build llama.cpp with
make CUBLAS=1running 11.6, Metal Family: Supported, Metal GPUFamily macOS 2@dhiltgen commented on GitHub (Feb 19, 2024):
@birchcode that sounds like a good step. What sort of performance are you able to achieve, and does it look promising?
Using the Metal API on Intel Mac for these other GPUs may complicate our memory detection and layer calculations. Somehow we'd need to refine https://github.com/ollama/ollama/blob/main/gpu/gpu_darwin.go to retrieve the GPU memory from some metal API, and then use an algo similar to the cuda/rocm version
@birchcode commented on GitHub (Feb 20, 2024):
@dhiltgen I went to execute but had no models. Previously I've already downloaded some models using the GUI - is it possible to use them somehow rather than download new ones? Not on the fastest connection right now ..
I download deepseek-ai/deepseek-coder-6.7b-instruct, and followed the guide to convert them but had some issues with that.
@dhiltgen commented on GitHub (Feb 20, 2024):
A little trick - if you run the
ollama servecommand and load up a model, you can see the file path of the model in the server log output, and then you can use that file to the llama.cpp server executable.Example:
Then in your llama.cpp repo, after building the server
@birchcode commented on GitHub (Feb 21, 2024):
That helped. Moving a little closer...
@FellowTraveler commented on GitHub (Mar 12, 2024):
Hey I use an intel Mac with AMD Radeon Pro 5500M which supports metal 3 API but having trouble getting Ollama to work, LMK when you fix this because there's a million more like me.
*EDIT: Apparently it's no good to use Metal 3 API because it's not optimized for AMD GPUs, but for silicon. So you have to use Vulkan or rocm or god knows what Idk.
@cracksauce commented on GitHub (Apr 18, 2024):
Any updates on this?
@xakrume commented on GitHub (Apr 20, 2024):
llama3 compiled from sources on my MacPro with AMD Radeon RX 6950 XT 16GB successfully utilized GPU.
@xakrume commented on GitHub (Apr 20, 2024):
llama3 build options
@renanwilliam commented on GitHub (Apr 24, 2024):
I have an Intel Mac with an i9 and a Radeon Pro Vega 20 4 GB GPU / 32GB RAM. Running Ollama is incredibly slow and almost unusable, unfortunately
@cracksauce commented on GitHub (Apr 26, 2024):
@xakrume Can you explain in more detail how you did this? You took the llama.cpp build using zig? What commands did you run? Did you make any changes to how you use ollama?
@dhiltgen commented on GitHub (Apr 26, 2024):
@xakrume if you're up for it, could you post a PR to update the x86 darwin build to add a metal variant?
The build portion would be added around here
and we'd need some adjustments to the GPU discovery logic to be able to identify when to use this variant. At present it's simplistic and just toggles CPU variants on x86, and always uses metal on arm but I think we'd need to actually discover on x86 if there is a metal GPU present. https://github.com/ollama/ollama/blob/main/gpu/gpu_darwin.go and https://github.com/ollama/ollama/blob/main/gpu/gpu_info_darwin.m
@xakrume commented on GitHub (Apr 26, 2024):
@cracksauce , sorry, my mistake. No need to run zig.
I'm using MacPorts packages
binary files located at bin/ at your current build direcotry.
CPU features from
zig targetsCPU Intel(R) Core(TM) i9-9900K info:
@xakrume commented on GitHub (Apr 27, 2024):
current
go generate ./...- generates binary with Metal support.Why
go build .creates anollamabinary file which does not used GPU?@xakrume commented on GitHub (Apr 27, 2024):
running with serv argument execs binary without Metal support:
cmd="/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/CPU_AVX2 /ollama_llama_server
Metal binary exist and is not used
@xakrume commented on GitHub (Apr 27, 2024):
versus M1 Pro CPU initialization:
@xakrume commented on GitHub (Apr 27, 2024):
I corrected DCMAKE_OSX_DEPLOYMENT_TARGET to 13.3 because there was a warning during the build:
I've decided it doesn't make sense to disable Metal support for macOS if the processor supports AVX2.
I removed the build for processors without AVX instructions and also leave the build with static libraries.
Adding a check for Metal support to select the appropriate binary would be beneficial. Currently, the runner is launched only for the CPU without considering Metal GPU support.
These changes allowed me to use my GPU for computing.
And now my build - utilizes AMD Radeon GPU
@xakrume commented on GitHub (Apr 27, 2024):
@herobs commented on GitHub (Apr 27, 2024):
@xakrume I've compiled successfully with metal enabled.
But it seems the GPU is slower than the CPU (about 2x). My setup is Intel i5-12400 + AMD 6600xt.
@night0wl0 commented on GitHub (Apr 28, 2024):
@herobs, @xakrume,
When you have a moment, would you be able to post your complete build steps starting with the clone of the fresh Ollama repo?
@herobs commented on GitHub (Apr 28, 2024):
@night0wl0 You should clone the backend llama.cpp instead. Then compile it with whatever method you want (like just
make). Now you have a GPU enabled backend binary, and you can invoke it with a model path.@xakrume commented on GitHub (Apr 29, 2024):
@herobs
My setup:
Intel Core i9 9900K
AMD RX 6950 XT 16GB
macOS: 14.4.1 (23E224) Sonoma
my performance with AMD GPU
llama.cpp, compiled from sources.
CPU benchmark from Ubuntu 24.04 on this workstation (LiveUSB)
I found some issues in llama.cpp with low GPU performance https://github.com/ggerganov/llama.cpp/issues/3422
@dev-zero commented on GitHub (Apr 30, 2024):
@dhiltgen Basically it would require to enumerate the GPU devices, filter by
.metal3and return therecommendedMaxWorkingSetSizeper device?And the scheduler should then pick the device with the largest memory?
I guess the CPU should not be added to the list but will automatically be chosen if the model doesn't fit into memory?
@xakrume commented on GitHub (Apr 30, 2024):
Which quant did you use?
https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix
Performance also based on type of compute device (CPU/GPU/Metal/etc.)
For Metal and AVX2 backends, a good choice for performance improvement is K-quants.
How did I figure it out.
Current ollama's Llama3 model - is Q4_0
@herobs commented on GitHub (Apr 30, 2024):
@xakrume I'm not familiar with these.
@xakrume commented on GitHub (May 1, 2024):
I'm not a Go lang developer, unfortunately. But currently, for macOS on Intel with AMD GPU, there is no option to select a server except CPU server, but I need a Metal server. Therefore, for the CPU binary, I'm using a server built for Metal. In
llm/payload.go, there is no checks for the presence of a PCIe GPU.Should we disable CPU servers and leave only the Metal server for macOS?"
@xakrume commented on GitHub (May 1, 2024):
There is an options for only Metal support, without CPU (avx/avx2/avx512) usage.
@l-m-mortal commented on GitHub (May 10, 2024):
@xakrume
Followed your steps - it compiled well.
But I got this error while trying to run llama3:
"Error: llama runner process has terminated: signal: abort trap error:unsupported op 'RMS_NORM"
I have r9 m370x and RX 570 as eGPU (both support Metal 2)
Is Metal 2 even relevant?
@night0wl0 commented on GitHub (May 17, 2024):
I am also encountering the same behavior, after some quick initial research it seems that
RMS_NORMand in my case alsoMUL_MATare unsupported operations in GGMLs Metal backend. My GPU is a Radeon Pro 560.@xakrume commented on GitHub (May 17, 2024):
I deleted all CPU builds, and only the metal remains:
@night0wl0 commented on GitHub (May 17, 2024):
@xakrume, thanks, while I was able to successfully build using the script, I still get the same
RMS_NORMop unsupported error like @l-m-mortal.@l-m-mortal commented on GitHub (May 26, 2024):
@xakrume
Thanks for your reply
Built as you shown (only for gpu), and got this
Error: [0] server cpu not listed in available servers map[metal:/var/folders/t4/b5z_lr0d1j55vnjw32vmzzmh0000gn/T/ollama3650114820/runners/metal]
Руслан? значит можем proceed in другой language?)
@tristan-k commented on GitHub (May 27, 2024):
+1
Running a RX 6600 XT with a i5-10600K
@Ehco1996 commented on GitHub (Jun 1, 2024):
i also meet this error
after diging into
llm/server.go, finally i can make ollama running on metal server by a hardcode walkaroundbut i found that use gpu is mucher slower than cpu, see more in
@guidocioni commented on GitHub (Jun 7, 2024):
Is this ever going to be supported?
I'm running on MacOS with i9-13900F and RX 6800XT, but right now, as I installed Ollama directly by downloading the pkg, it only uses the CPU cores
@cmarhoover commented on GitHub (Jun 15, 2024):
Does llama.cpp commit f8ec887 address this problem in some way? Seems that precompiled builds of llama.cpp after April 2 were impacted. Issue #7940
@tristan-k commented on GitHub (Jun 18, 2024):
Indeed the latest
llama.cpp(b3173) does use the gpu on my macOS Sonoma installation.Is there any way to exchange the latest
llama.cppbinaries inollamabecause I want to use Open WebUI which depands onollama- or is there a time window when the changes will arrive inollama?@dbl001 commented on GitHub (Jun 18, 2024):
GPT2 and Llama2 appear to be working.
Llama3 crashes: GGML_ASSERT: ggml-metal.m:1769: false && "not implemented"
I used these build parameters:
Here's an attempt to run llama3 bf16 on an iMac 27" with an AMD Radeon Pro 5700 XT:
The quantized version of llama3: Llama-3-8B/ggml-model-bf16-Q4_K_M.gguf -n 128 -ngl 1 -i output gibberish.
Llama2 7B:
Quantized GPT2
@dbl001 commented on GitHub (Jun 19, 2024):
'ollama server' shows that only the CPU running when computing llama3 and mistrial embeddings (see below). Is there a way to build ollama on the Mac (e.g. darwin) to utilize the AMD GPU, which runs with llama.cpp's 'main' (see comment above)?
@cracksauce commented on GitHub (Jun 19, 2024):
Would also appreciate a solution for this
@dhiltgen commented on GitHub (Jun 19, 2024):
For folks in the community working on this, keep in mind there are 2 pieces of the puzzle that will need to be implemented to make it work.
First is figuring out the right flags to pass to cmake for llama.cpp to compile the x86 metal variant and wire that up as a new runner, most likely called "metal" here.
Second is wiring up "GPU discovery" with VRAM lookup. At startup we discover what GPUs are present, and specifically how much VRAM they have available so we can schedule model loads that don't exceed the available memory. Modifications in gpu_darwin.go and gpu_info_darwin.m will be needed. We need to get the current VRAM usage during runtime for the scheduler to be able to support concurrency as well.
@l-m-mortal commented on GitHub (Jun 25, 2024):
thanks to @xakrume my MacBook Pro 15 2015 with amd gpu managed to run ollama serve,

but it prioritizes embedded gpu (AMD Radeon R9 M370X) instead of eGPU (AMD Radeon RX 570)
@ahornby commented on GitHub (Jul 13, 2024):
I got it working based on above info in https://github.com/ahornby/ollama/tree/macos_amd64_metal, however the metal backend on my macbook 15 2019 560X was slower than the CPU, so giving up. Maybe the commit will be useful to someone else with a faster gpu, it has the commands used in the commit message
@dbl001 commented on GitHub (Jul 13, 2024):
@ahornby I tried cloning your fork and running ollama on my 2022 iMac 27" with an AMD Radeon Pro 5700 XT. It doesn't appear to find the GPU. Do you see anything I missed?
@ahornby commented on GitHub (Jul 13, 2024):
@dbl001 spotted two differences, your log indicates:
go build .whereas in my commit message the command isCGO_CFLAGS="-I/usr/local/include" CGO_LDFLAGS="-L/usr/local/lib -framework Accelerate" go build .ollama servewhereas in my commit message the command isGIN_MODE=debug OLLAMA_LLM_LIBRARY=metal ./ollama serve@dbl001 commented on GitHub (Jul 13, 2024):
@ahornby My GPU has 16GB... Any suggestions?
Here's the full output.
@ahornby commented on GitHub (Jul 13, 2024):
@dbl001 first try with codellama, that's the only model I tried. If that works you got as far as I did
@dbl001 commented on GitHub (Jul 13, 2024):
@ahornby same thing with codellama
@ahornby commented on GitHub (Jul 13, 2024):
in that case I don't know, guessing its due to different GPU. Hopefully you can work out the problem on your card now that you have a way to build and test
@dbl001 commented on GitHub (Jul 13, 2024):
@ahornby Thank You!
@Grergo commented on GitHub (Jul 14, 2024):
Why don't we give Vulkan a try? I've successfully run the Vulkan backend of llama.cpp on an Intel Mac, utilizing the AMD GPU. The performance is slightly better than using just the CPU.


Intel 12700 AVX2:
AMD 6600XT Vulkan:
@ahornby commented on GitHub (Jul 14, 2024):
@Grergo was able to build ollama with vulkan but output is garbled on my 560X. Feel free to try where I left off from commit
5709e59e10(branch is https://github.com/ahornby/ollama/tree/macos_amd64_gpu)@dbl001 commented on GitHub (Jul 14, 2024):
@ahornby In my previous attempt from the master branch ollama was able to detect the GPU:
With your branch (e.g. - branch is https://github.com/ahornby/ollama/tree/macos_amd64_gpu) I'm ggml_metal_init didn't get as far:
Here's what's changed.
https://github.com/ollama/ollama/compare/main...ahornby:macos_amd64_gpu
@ahornby commented on GitHub (Jul 14, 2024):
@dbl001 try the vulkan mode, that's what's new
@Grergo commented on GitHub (Jul 15, 2024):
@ahornby I built Ollama from commit 5709e9 and it outputs content smoothly without any garbled output.



@ahornby commented on GitHub (Jul 15, 2024):
@Grergo nice! Glad it was useful to someone. I guess my 560X is just too old
@Grergo commented on GitHub (Jul 15, 2024):
@ahornby Thank you for your work. I suspect the issue might be caused by the R560x not supporting Metal 3, but this is just my guess. Currently, there are still performance issues with Vulkan, and the GPU utilization is not high.
@dbl001 commented on GitHub (Jul 16, 2024):
@ahornby Why do I get:
I tried setting: LLAMA_SUPPORTS_GPU_OFFLOAD=on, but it doesn't help.
@ahornby commented on GitHub (Jul 16, 2024):
@dbl001 the log indicates you've not built the vulkan support. The indications being:
Perhaps you had a build error, or didn't run the right build command.
Please read the commit message for the commands to install vulkan libs and build. If no vulkan binary is produced from the build then there is no way to run it at the next step.
@dbl001 commented on GitHub (Jul 17, 2024):
I had to make some changes in llm/generate/gen_darwin.sh to switch from homebrew to macports. The server log looks reasonable. However, the output is either garbled, null or in the case of embeddings, all zeros.
(AI-Feynman) davidlaxer@BlueDiamond-2 ollama % GIN_MODE=debug OLLAMA_LLM_LIBRARY=vulkan ./ollama serve
2024/07/16 19:10:17 routes.go:958: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:vulkan OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/davidlaxer/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]"
time=2024-07-16T19:10:17.641-07:00 level=INFO source=images.go:760 msg="total blobs: 38"
time=2024-07-16T19:10:17.646-07:00 level=INFO source=images.go:767 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers)
[GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers)
[GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-07-16T19:10:17.647-07:00 level=INFO source=routes.go:1005 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-07-16T19:10:17.649-07:00 level=WARN source=assets.go:100 msg="unable to cleanup stale tmpdir" path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4050472326 error="remove /var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4050472326: directory not empty"
time=2024-07-16T19:10:17.649-07:00 level=WARN source=assets.go:100 msg="unable to cleanup stale tmpdir" path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4115672533 error="remove /var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4115672533: directory not empty"
time=2024-07-16T19:10:17.650-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1471678898/runners
time=2024-07-16T19:10:17.684-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal vulkan]"
time=2024-07-16T19:10:17.708-07:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=vulkan compute="" driver=0.0 name="" total="16.0 GiB" available="16.0 GiB"
[GIN] 2024/07/16 - 19:16:26 | 200 | 61.932µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/07/16 - 19:16:26 | 200 | 20.510101ms | 127.0.0.1 | POST "/api/show"
time=2024-07-16T19:16:26.264-07:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=0 parallel=4 available=17163091968 required="6.3 GiB"
time=2024-07-16T19:16:26.265-07:00 level=INFO source=memory.go:309 msg="offload to vulkan" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[16.0 GiB]" memory.required.full="6.3 GiB" memory.required.partial="6.3 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.3 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-07-16T19:16:26.265-07:00 level=INFO source=server.go:172 msg="user override" OLLAMA_LLM_LIBRARY=vulkan path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1471678898/runners/vulkan
time=2024-07-16T19:16:26.266-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1471678898/runners/vulkan/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 52304"
time=2024-07-16T19:16:26.269-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-16T19:16:26.269-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-16T19:16:26.269-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff844b27fc0" timestamp=1721182586
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff844b27fc0" timestamp=1721182586 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="52304" tid="0x7ff844b27fc0" timestamp=1721182586
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv 2: llama.block_count u32 = 32
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
time=2024-07-16T19:16:26.771-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon Pro 5700 XT (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 281.81 MiB
llm_load_tensors: AMD Radeon Pro 5700 XT buffer size = 4155.99 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: AMD Radeon Pro 5700 XT KV buffer size = 1024.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 2.02 MiB
llama_new_context_with_model: AMD Radeon Pro 5700 XT compute buffer size = 560.00 MiB
llama_new_context_with_model: CPU compute buffer size = 0.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 24.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
INFO [main] model loaded | tid="0x7ff844b27fc0" timestamp=1721182602
time=2024-07-16T19:16:42.066-07:00 level=INFO source=server.go:617 msg="llama runner started in 15.80 seconds"
[GIN] 2024/07/16 - 19:16:42 | 200 | 15.831344335s | 127.0.0.1 | POST "/api/chat"
[GIN] 2024/07/16 - 19:16:51 | 200 | 2.97464185s | 127.0.0.1 | POST "/api/chat"
[GIN] 2024/07/16 - 19:17:05 | 200 | 2.97567865s | 127.0.0.1 | POST "/api/chat"
time=2024-07-16T19:22:10.921-07:00 level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.00131764 model=/Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-16T19:22:11.172-07:00 level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251858415 model=/Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-16T19:22:11.422-07:00 level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501923465 model=/Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
@Gantaronee commented on GitHub (Aug 2, 2024):
Error: llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Metal KV buffer size = 16384.00 MiB
llama_new_context_with_model: KV self size = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
ggml_gallocr_reserve_n: failed to allocate Metal buffer of size 8891928576
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model '/Users/xxxxxx/Downloads/Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf'
main: error: unable to load model
Solution (no error):
./llama-cli -m /Users/xxxx/Downloads/Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf -n 32 --n-gpu-layers 35 --ctx_size 2048 --batch-size 512
llama_kv_cache_init: Metal KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: Metal compute buffer size = 258.50 MiB
llama_new_context_with_model: CPU compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 2048, n_batch = 512, n_predict = 32, n_keep = 1
def dance_party(dance_moves):
"""
This function takes a list of dance moves and returns a string representing a dance party.
llama_print_timings: load time = 6747.93 ms
llama_print_timings: sample time = 7.48 ms / 32 runs ( 0.23 ms per token, 4280.94 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 0 tokens ( nan ms per token, nan tokens per second)
llama_print_timings: eval time = 29722.94 ms / 32 runs ( 928.84 ms per token, 1.08 tokens per second)
llama_print_timings: total time = 29744.97 ms / 32 tokens
Log end
But poor performance , It seems that in the processes, the GPU is not being used. I have an AMD Radeon Pro 5600M 8GB.
@Dirrelito071 commented on GitHub (Aug 3, 2024):
Hi, I'm running with a eGPU and it seems like it's making a bad choice taking my internal GPU. Is there a way to force to use the Vega64 instead using a flag or something? Or to change my default device?
ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon RX Vega 64
ggml_metal_init: found device: Intel(R) UHD Graphics 630
ggml_metal_init: picking default device: Intel(R) UHD Graphics 630
ggml_metal_init: using embedded metal library
@Dirrelito071 commented on GitHub (Aug 4, 2024):
Solved this problem myself.
Just by connecting my eGPU to a monitor.
this made my Mac declare the default GFX used to be my Vega 64. And also Ollama.
Now my problem became that the output was garbled. Don’t know any solution to that, if someone has a possible solution I’m happy to try it.
@mfoxworthy commented on GitHub (Aug 4, 2024):
I have it "working" on a Macbook Pro i9 AMD Radeon Pro 5500M but it too is slower than the CPU. It used 99% of the CPU with the codellama LLM. Unless I am missing something, I'm just going to give up and just use my M1 for this. That performs extremely well even with 27b LLMs.
INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff84acf1fc0" timestamp=1722784374
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff84acf1fc0" timestamp=1722784374 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="60974" tid="0x7ff84acf1fc0" timestamp=1722784374
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/mfoxworthy/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = codellama
llama_model_loader: - kv 2: llama.context_length u32 = 16384
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 2
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["", "
", "", "<0x00>", "<...llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1686 MB
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32016
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 16384
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name = codellama
llm_load_print_meta: BOS token = 1 '
''llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: PRE token = 32007 '▁
@Dirrelito071 commented on GitHub (Aug 4, 2024):
Seems like many now got their GFX-card identified. Then it's whole other matter to make use of it. Hope someone smart and knowledgeable can try out some solutions to get more power to be used. And get rid of of the garbled outputs.
I'm mostly saying this, since a working use case could drive this issue / matter forward to a more complete solution for everyone.
@akaraon8bit commented on GitHub (Aug 5, 2024):
Hello all am runing % Sonoma sw_vers
ProductName: macOS
ProductVersion: 14.6
BuildVersion: 23G80
followed different build from the thread
Ollama call failed with status code 500: llama runner process has terminated: signal: segmentation fault
build with
OLLAMA_SKIP_CPU_GENERATE=on OLLAMA_CUSTOM_CPU_DEFS="-DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_F16C=on -DLLAMA_FMA=on -DLLAMA_METAL=on -DLLAMA_METAL_EMBED_LIBRARY=on -DGGML_USE_METAL=on -DLLAMA_METAL_COMPILE_SERIALIZED=1" -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DLLAMA_SUPPORTS_GPU_OFFLOAD=on go generate -v ./...@dbl001 commented on GitHub (Aug 12, 2024):
I got ollama to run llama 3.1 on my iMac 27" utilizing an AMD Radeon Pro 5700 XT.
I had to modify ggm-metal.m to circumvent an problem where id device = MTLCreateSystemDefaultDevice(); always returned 'nil'. I still don't know why.
Running llama 3.1:
Here is my patch:
Server log:
Build commands:
Eval:
@raparici commented on GitHub (Sep 3, 2024):
Is this fix going to main ?
@cracksauce commented on GitHub (Sep 17, 2024):
Commenting in support of this feature request and optimize the Ollama experience for users with AMD GPUs, particularly those using eGPUs on Intel Macs, who've historically been unable to utilize their graphics hardware for acceleration.
@THL-Leo commented on GitHub (Sep 17, 2024):
@Grergo What flags did you use to generate and build with Vulkan? I followed @ahornby's guide and received empty spaces as output. I am running on i9 intel mbp with 5500m GPU. I was able to run metal but not vulkan.
@dbl001 commented on GitHub (Sep 17, 2024):
Same here. Vulkan gives gibberish. Metal works better, but I get GPU errors after the first or second prompt.
iMac 27” 2021 w/AMD Radeon Pro 5700 XT.
@Grergo commented on GitHub (Sep 18, 2024):
Sorry, I haven't followed up on this issue for a while. I used the default parameters to build from this commit: 5709e5. The GPU is RX6600XT, hope this helps you.
@THL-Leo commented on GitHub (Sep 18, 2024):
@dbl001 Mine too, my only guess is that maybe our GPUs don't support certain version of vulkan/metal. Even when I ran with metal my performance is way worse than the one claimed by ahornby. He is running 4 tokens per second with an older model of macbook while I am getting 1.5-2 tokens per second using 5500m.
@Grergo I see. Thank you for your input, when I ran with the default parameters from the commit I had an issue where static build doesn't exist. I removed the static flag and it ran. But Vulkan still returns gibberish.
I am not too sure what to mess with in the files since I don't have much experience with ollama or llama.cpp.
@dbl001 commented on GitHub (Sep 18, 2024):
llama3 on Metal:
Some times I get these messages about a GPU issue:
@THL-Leo commented on GitHub (Sep 19, 2024):
I get similar results on phi3. Perhaps our GPUs are just too weak/outdated for this.
@duolabmeng6 commented on GitHub (Sep 21, 2024):
How to support AMD Radeon Pro 5500?
@SecuritySura commented on GitHub (Oct 26, 2024):
I can see this thread is older than a year. I also have Mac with AMD Radeon Pro 5500M (8GB) GPU. still Ollama not support this GPU? or anyone found a solution? appreciate your kind response.
@TomDev234 commented on GitHub (Oct 31, 2024):
Can you upload your binaries somewhere?
I have build ollama according to your instructions with the current source code. But the executable does not find any runners.
@aes512 commented on GitHub (Nov 15, 2024):
Same here.
@21307369 commented on GitHub (Nov 19, 2024):
My graphics is 6750Gre 12G. How can I make this graphics usable for ollam under macOS?
@TomDev234 commented on GitHub (Nov 19, 2024):
Fix ollama's metal support for Intel Macs.
@alsyundawy commented on GitHub (Dec 17, 2024):
any update? becouse my mb 2018 rx 560 not working always cpu not gpu
ollama version is 0.5.3
Radeon Pro 560X:
@soerenkampschroer commented on GitHub (Dec 22, 2024):
Llama.cpp now supports my GPU in both Metal and Vulkan (RX 6800). Unfortunately, Metal is still about half as fast as the CPU. Vulkan on the other hand is extremely fast for me. I did a small benchmark with gemma-2-9b-it.Q5_K_M.gguf:
Metal 3.6 t/s
CPU 5.7 t/s
Vulkan 41.3 t/s
The problem is that Vulkan is not stable at all and will descend into gibberish half the time. The longer the prompt, the higher the chance for it to go wrong.
@MarcelHeemskerk commented on GitHub (Jan 6, 2025):
Is that using MoltenVK @soerenkampschroer ? I am not aware of native support for Vulkan on MacOS.
@soerenkampschroer commented on GitHub (Jan 6, 2025):
@MarcelHeemskerk Yes that is using MoltenVK. Performance is great, but it seems like it's too buggy. The latest build of llama.cpp is not working for me anymore, as the output is now fully corrupted and it's failing a lot more of the backend tests. That's where I gave up.
@FellowTraveler commented on GitHub (Jan 17, 2025):
Did you inform the Llama.cpp team? The bug might be there, rather than in MoltenVK.
@soerenkampschroer commented on GitHub (Jan 17, 2025):
I did, you can find the issue here.
They assume it's a driver/moltenvk bug.
@FellowTraveler commented on GitHub (Jan 18, 2025):
See this: https://github.com/KhronosGroup/MoltenVK/issues/2423#issuecomment-2599817892
@marekk1717 commented on GitHub (Jan 30, 2025):
I've got the same problem on hackintosh on Sequoia with AMD GPU, Vulkan and MoltenVK. It starts creating answer super fast (compared to CPU) but then it prints some strange characters.
@dboyan commented on GitHub (Feb 3, 2025):
For people trying to use vulkan backend via MoltenVK but getting garbled output, you may want to try out the change in MoltenVK/2434. This is a PoC fix to address an issue within MoltenVK and llama.cpp now works mostly as expected with it. We will try to push the change to mainline after finalizing the solution.
@marekk1717 commented on GitHub (Feb 3, 2025):
Thanks !
Can you share steps how to compile it and how to build llama.cpp to properly include Vulkan?
@dboyan commented on GitHub (Feb 3, 2025):
You can follow the MoltenVK instruction here, especially the "Install MoltenVK to Replace the Vulkan SDK libMoltenVK.dylib" part. Just switch the codebase to the PR branch. If you installed MoltenVK from sources other than LunarG Vulkan SDK (e.g., from homebrew), you'll need to check the path where they actually install their
libMoltenVK.dyliband replace the file there (according to https://github.com/KhronosGroup/MoltenVK/issues/2423#issuecomment-2636430442).If simply building llama.cpp, you can just add
-DGGML_VULKAN=ONto the cmake arguments as written here. It should work as long as you installed Vulkan SDK and replace the libMoltenVK.dylib with the custom version.@marekk1717 commented on GitHub (Feb 4, 2025):
still doesn't work:
What is 1+1?imatelyA9H0>!&:;6<F'<13%E<3'B4G$11%(52,?167?E?<B2'0&4F=:8%FE&A9%9B-!.D:1>HH8,>C&(B-E,#/1=#.#3--7>>67;%":H8"$.C4<<086+=5;<(B-A14&;.?!.7/5.46-8%10)7&-0B91E
@soerenkampschroer commented on GitHub (Feb 4, 2025):
Similar results here. Depending on the model I'm able to get some correct output but then it's garbled again after a couple hundred tokens. When it works it's very fast for me though.
There are also failing tests in test_backebd_ops. I can send some logs later today but the mul_mat tests that I isolated in the other issue still fail.
For reference im I'm on an Intel mac with an AMD RX 6800.
@dboyan commented on GitHub (Feb 4, 2025):
Interesting, on m1 all test_backend_ops cases are passing. Maybe there is still something not right on older gpus. Unfortunately I don't have an intel mac with amd gpu, but if you are able, please tell us the information about tests that are still failing, preferably on https://github.com/KhronosGroup/MoltenVK/issues/2423
Also, just curious. What are the models that you use which give garbled result? For me on m1, a few models that I tried worked well for most of the time, if not always. There are some model that gives garbled output sometimes, and I'm not totally sure what's wrong there. Also, metal and vulkan backends seems similarly fast on m1.
@soerenkampschroer commented on GitHub (Feb 4, 2025):
Yes, Apple Silicon has much better support than Intel. While the Metal backend works on Intel/AMD, it is about half as fast as just running on the CPU. It does run flawlessly though, no corruption at all. The reason why people tried Vulkan/MoltenVK is that the speeds are great, but there is the issue of corrupted output. Maybe it could also be possible to speed up the metal backend, but I understand that soon to be deprecated hardware is not a top priority.
I've been using gemma-2-2b-it.Q8_0.gguf for testing, and it works for a while, but then it's corrupting. Same with Phi-3.1-mini-128k-instruct-Q8_0.
qwen2.5-coder-7b-instruct-q5_k_m corrupts after the first token and just repeats "@@@@".
I'll compile a list of the failing tests and post them on the other issue. I'm failing 383 tests as of the latest builds.
@dboyan commented on GitHub (Feb 4, 2025):
I do see the same corruption patterns sometimes on a few models even on m1 with vulkan. But other models worked flawlessly for very long interactions. I'll try the model you mentioned locally when I have time.
Thanks a lot! Just to make sure, you are using my branch instead of simply the latest main branch right?
If possible, please isolate one failed test case and capture a gputrace as we have done earlier. Although I cannot replay the trace as-is, we can try to inspect the trace for different code path.
@soerenkampschroer commented on GitHub (Feb 4, 2025):
No problem at all, happy to help where I can! I've been using your branch of MoltenVK and then the latest main branch from llama.cpp. I've also made sure that
/usr/local/lib/libMoltenVK.dylibis installed correctly after building your pr. Then I compiled llama.cpp like this:cmake -B build -DLLAMA_CURL=1 -DGGML_METAL=OFF -DGGML_VULKAN=1 -DVulkan_INCLUDE_DIR=/usr/local/Cellar/vulkan-headers/1.4.307/include -DVulkan_LIBRARY=/usr/local/lib/libvulkan.1.4.307.dylibI see that you mentioned only using
-DGGML_VULKAN=ON, does that make a difference?@dboyan commented on GitHub (Feb 4, 2025):
I don't think so. It will build with both vulkan and metal backends.
@dboyan commented on GitHub (Feb 5, 2025):
@soerenkampschroer I suspect that your program are still using the original library without my change (or something is going wrong with my heuristic of telling the macro, which I can hardly imagine how). With the gputrace I captured on my own, I can see a few macros have been defined in the metal library, just like this:
But the macro definition is blank within your trace above. (To find it, first find call 203 in your trace by expanding the "vkQueueSubmit" under call 189, and the "vkCmdDispatch" inside. And then double click "Compute Pipeline 0x7f8..." on the right. Expand "Compute Function" > "Compile Option" inside)
@soerenkampschroer commented on GitHub (Feb 7, 2025):
I just wanted to add to this issue that the fix to MoltenVK by @dboyan works great on my machine. I'm now able to use llama.cpp with GPU acceleration.
However, I wasn't able to compile ollama with vulkan support on macOS. There is a pull request to add vulkan support for linux, but I couldn't figure out how to make that work either. It's possible but needs some work.
@marekk1717 commented on GitHub (Feb 7, 2025):
Would it be possible for you step by step how you build everything?
@soerenkampschroer commented on GitHub (Feb 7, 2025):
This is retracing my steps from memory, but it should at least get you on the right track.
Install
Note: The path will be different depending on the version of molten-vk you installed.
Copy
./Package/Release/MoltenVK/dynamic/dylib/macOS/libMoltenVK.dylibto/usr/local/Cellar/molten-vk/1.2.11/lib/.Build llama.cpp
Clone the repo as normal and build it with:
@marekk1717 commented on GitHub (Feb 7, 2025):
Thank you. PLease can you share also a llama command line incl model name and all parameters?
@soerenkampschroer commented on GitHub (Feb 7, 2025):
All the models I've tested so far worked, but as an example:
@marekk1717 commented on GitHub (Feb 7, 2025):
Thank you !!!
Now it works and it's crazy fast :)
@marekk1717 commented on GitHub (Feb 7, 2025):
Next question, any recommendation for good Web UI that can be connected to llama-server? ;)
@THL-Leo commented on GitHub (Feb 7, 2025):
Do you know if this will work with Ollama as well? Or just Llama.cpp.
I was able to get it working on llama.cpp thanks to you :)
@soerenkampschroer commented on GitHub (Feb 7, 2025):
Ollama is built on top of llama.cpp, so yes it would work. The problem is, ollama does not use vulkan in its macOS or linux version, and they don't intend to at the moment. See here.
I'm not familiar enough with the project to say how much work it would be to add vulkan support, but it doesn't look as easy as with other projects.
As for other GUIs:
I was able to build jan.ai with vulkan support, but it uses a custom layer on top of llama (cortex) and for some reason I got like half the speed as with just llama.cpp. Should be possible to be improved, but that's how deep I was willing to dig.
Another way would be to use llama-server and then a webgui like open-webui through the openai compatible API. But there is no model management and it's a bit cumbersome.
Then there is LocalAI which looks promising and should be relatively easy to build with llama.cpp+vulkan. That's next on my list.
As long as the fix is not permanent and merged into MoltenVK I'm hesitant to open issues and ask for vulkan support.
@TomDev234 commented on GitHub (Feb 8, 2025):
Is there a macOS gui available for llama.cpp? I couldn't find any.
@THL-Leo commented on GitHub (Feb 8, 2025):
Think it would be faster to just build a simple one using UI libraries and host it on localhost. That's what I plan on to do right now at least.
@prabhu commented on GitHub (Feb 22, 2025):
https://github.com/KhronosGroup/MoltenVK/pull/2441 got merged.
Step 2 from this comment becomes:
@soerenkampschroer commented on GitHub (Feb 22, 2025):
@prabhu Did you test it on Intel/AMD?
The merged fix does not work for me.
@prabhu commented on GitHub (Feb 22, 2025):
Any errors? I don't see any difference in performance between the two branches.
@soerenkampschroer commented on GitHub (Feb 22, 2025):
Interesting. On my machine, the output is corrupted like before.
Could you tell me what version of macOS and what GPU you're running?
I've been trying to find a fix in the issue over at MoltenVK.
@prabhu commented on GitHub (Feb 22, 2025):
@soerenkampschroer commented on GitHub (Feb 22, 2025):
Thanks for the info!
So it appears that the fix works for 5000 but not 6000 series GPUs. Or there is something wrong with my setup.
Does anyone have a 6000 series GPU and is willing to test the latest master branch of moltenvk?
@jeffklassen commented on GitHub (Feb 24, 2025):
this fix does not work for me:
@tristan-k commented on GitHub (Feb 24, 2025):
I can test with a 6000 series GPU but I'm kinda confused what steps need to be done in order to get there.
@dboyan commented on GitHub (Feb 24, 2025):
We have known that RX 6000 series is not working with the latest main branch, but that is due to another independent issue (https://github.com/KhronosGroup/MoltenVK/issues/2458). There is a workaround, though. To make it work for now, follow the steps in https://github.com/ollama/ollama/issues/1016#issuecomment-2642713162, but replace step 2 with:
@rchesnut-amgteam commented on GitHub (Feb 25, 2025):
I followed these steps and get a kernel panic on build every time.
My specs:
i9 intel 3.18Ghz
RX 6800xt
@kanadgodse commented on GitHub (Feb 26, 2025):
~~Even I am facing the same issue.
When I cloned ollama, I found that the directory ml/backend/ggml/ggml/src does not have the directory ggml-vulkan
How to add that?~~
I am an idiot!
I needed to clone https://github.com/ggml-org/llama.cpp and build that. Not Ollama!
@jeffklassen Try what I did above and then it should work.
@soerenkampschroer commented on GitHub (Feb 26, 2025):
@kanadgodse did you test if ollama is using your GPU?
It will compile that way but it will use the CPU instead of the Vulkan backend I'm pretty sure.
@kanadgodse commented on GitHub (Feb 27, 2025):
Not yet, I am going to download a model and check. Will keep you posted.
@kanadgodse commented on GitHub (Feb 27, 2025):
I got it to run using Vulkan! But the output is not coming properly.
Here's a screenshot that confirms that it's using the GPU when generating the output:
Here's what I ran and the output:
Now I need to figure out how to get the output properly now that it's blazing fast thanks to the GPU usage!
Any help will be appreciated!
@soerenkampschroer commented on GitHub (Feb 27, 2025):
Oh I thought you were trying to build ollama. Yes, llama.cpp works that way. You need to compile MoltenVK yourself for it to work though.
https://github.com/ollama/ollama/issues/1016#issuecomment-2677708722
@kanadgodse commented on GitHub (Feb 27, 2025):
Another update!
I had passed
--n-gpu-layers 25as I thought my slow GPU would not be able to handle it but once I passed--n-gpu-layers 100I was not getting missing text, but still I was not getting coherent text.Maybe DeepSeek R1 really does not run locally in macOS.
I will try other models, but this link helped me to convert DeepSeek R1 to gguf format:
https://medium.com/@manuelescobar-dev/achieve-state-of-the-art-llm-inference-llama-3-with-llama-cpp-c919eaeaac24
Also, I am using https://chatboxai.app/en to interface with the model running locally via llama-server but still am getting incoherent output:
@kanadgodse commented on GitHub (Feb 27, 2025):
Yes, I referenced your comment above (https://github.com/ollama/ollama/issues/1016#issuecomment-2642713162) to compile both MoltenVK and llama.cpp with vulkan back end and then I follwed the medium link to try and run DeepSeek R1 locally.
@brendensoares commented on GitHub (Mar 19, 2025):
@dboyan's solution worked for me:
followed by a local build:
Note, I had some issues during the dep fetch step that was resolved by following the error instructions (re:
xcodebuild -runFirstLaunch). Once the build process completed I coped the dylib from./Package/Release/MoltenVK/dynamic/dylib/macOS/libMoltenVK.dylibto where I placed it when I built llama.cpp; no new llama.cpp build required, which is the point/benefit of dynamic libraries. I then ran llama.cpp locally again and no more garbage/gibberish responses.Bonus Context
I built llama.cpp with the following config:
@thecannabisapp commented on GitHub (Apr 5, 2025):
Can confirm llama.cpp working on MacOS 15.4 running Intel CPU and AMD 6800XT using the steps above. I installed dependencies using brew as per @soerenkampschroer & @rchesnut-amgteam above. Then built MoltenVK. Do not install Vulken SDK using the installer. I tried that and it was outputting gibberish.
Here's the result of using MoltenVK via local build using instructions above.
@kyvaith commented on GitHub (Jun 13, 2025):
So, the trick to build MoltenVK and LLAMA-CLI with instructions provided in comments works for me. Works great on Hackintosh with AMD Radeon RX 6750 XT 12 GB GPU, AMD Ryzen 5 5600 CPU and 32GB RAM. Now, How can I build Ollama with it, or wrap somehow?
@anakayub commented on GitHub (Jul 8, 2025):
I followed the links here and here and managed to build llama.cpp with vulkan support. Some added steps to finetune the server (I used Ollama with AnythingLLM prior to this). Not for pure tech beginners, but I'm seeing 50-400% speed improvements. I wonder if the previous update that would speed up things using AVX-512 didn't work on MacOS (I'm on an iMac Pro 8 core, Vega 56, 64 GB RAM). I personally see yet another reason to not move on towards Apple Silicon. People elsewhere talked about activating "flash attention". I found it to reduce performance. So I basically used default llama cpp vulkan settings except for model-specific recommendations (Qwen3-30B-A3B). Before this I had the convenience of multitasking while waiting for the responses; now they're gone (but probably better for my electric bill).
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | OPENMP = 1 | REPACK = 1 |@brendensoares commented on GitHub (Aug 12, 2025):
@anakayub flash attention can allow for larger context windows which is very valuable for certain use cases, like local coding with AI with tools like aider.
https://chatgpt.com/s/t_689b764cf3d08191bd4340bf840621fb
@brendensoares commented on GitHub (Aug 12, 2025):
@kyvaith Ollama vendors llama.cpp, so you should be able to clone ollama's git repo and cd into the
llama/llama.cppand do your custom build there. Once that's done you should be able to build ollama which will use the custom build in the vendored path.I didn't even think to explore this path myself. I was just using the llama.cpp server that is included and I recently started using
llama-cpp-pythonfrom github as a frontend. Being able to use ollama's ecosystem would be ideal.This may also help: https://chatgpt.com/s/t_689b78929518819191b759803b265241
EDIT: note, I have recently had to compile a vendored version of llama.cpp for another codebase and I'll tell you it's important to use the vendored version that is expected by the consuming codebase eg ollama. You can't simply swap in your own local llama.cpp build path if it does not match the expected version.
@mrglutton commented on GitHub (Aug 12, 2025):
UPDATE: I managed to get llama.cpp to work on iMac Pro Vega 64 on Sonoma 15.6 using molten-vk.
In short: some of the build scripts have errors, and VulkanSDK has errors. It works in the end, in practicall all AMD GPUs on Macs. Even on trashcan macs. :-) (I needed to read this somewhere.)
I will divide the post in two:
==== 1 (didn't work, maybe only for newer cards?) ====
I managed to build llama.cpp usign here
Few notes:
This is the build recipe I used:
============================
Now the issue... After building llama.cpp detects my Vega 64:
But it doesn't use it.
If I force the GPU usign command line, I get a load of cr*p in the terminal.
Using this:
./llama-cli -m ../llm-models/meta-llama-3.1-8b-instruct-q4_0.gguf -ngl 8 -b 512 -ub 128
I get:
@@@@@@@@@@@==== 2 WORKED! For Vega ====
On that post someone said solution 1 does't work and pointed to: here
If you first tried 1., that doesnt work. You will prob. get gibberish on older GPUs, altho it will probably work on new glpus like 6800.
To "fix", you need:
(/bin/bash-c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)")`brew install cmake git libomp vulkan-headers glslang molten-vk shadercTo start the server you can use this directly:
./build/bin/llama-server --hf-repo hugging-quants/Llama-3.2-1B-Instruct-Q8_0-GGUF --hf-file llama-3.2-1b-instruct-q8_0.gguf -c 2048 --n-gpu-layers 999To run better (3B) model, run:
./build/bin/llama-server --hf-repo hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF --hf-file llama-3.2-3b-instruct-q8_0.gguf -c 2048 --n-gpu-layers 29==== Performance ====
45-50 t/s using Llama-3.2-3B-Instruct-Q8_0-GGUF
110-120 t/s using Llama-3.2-1B-Instruct-Q8_0-GGUF
The speed on 1B model is practically instant. It summarised this paper in one second. Arguably it is not that good, but it is good enough. (CPU is glacial in comparison, and I used that before.) 3B is also very fast. It completes the task in about 5? seconds.
Here is a benchmark paper: Twin modelling reveals partly distinct genetic pathways to music enjoyment
I fed it PDF directly.
IMG:
============================================================
For me the difference is staggering. It works so fast that I almost can't believe it. I am yet to try faster models.
I included entire bin as attachment. This is on Sequoia 15.6 and if you isntall dependencies it might work for you?
llama.cpp_built_bin.zip
**The file is ZIP, but there is a 7z file inside, you need to rename the to 7z and unpack. IT IS NOT A VIRUS. LOL ZIP was too large **
@Splash04 commented on GitHub (Aug 24, 2025):
Looks like issue that force to use:
git fetch origin pull/2434/head:p2434
I was able to run model on gpu using my iMac 27 with AMD Radeon Pro 5700 XT 16 GB
@paoloaveri commented on GitHub (Nov 7, 2025):
I was able to run llama.cpp on my
Radeon Pro 555X 4 GB(2019 MacBook Pro) using a prebuilt MoltenVK package, instructions here: https://gist.github.com/paoloaveri/31a58a37525b6214ba3ff14fdb90acaf@jamfor999 commented on GitHub (Nov 14, 2025):
If anyone's curious, I have managed to get this working.
I've pushed a fork of Ollama which works for me using an AMD GPU via MoltenVK. https://github.com/jamfor999/ollama
(thanks to the gist above from @paoloaveri and other people pointing in the right direction)
I probably won't be keeping this up to date much other than just merging in main. Probably good to soak test and get some feedback from peeps here who wanted it - until then I'd rather wait before raising a PR.
@tristan-k commented on GitHub (Nov 15, 2025):
@jamfor999 Thanks. I noticed that
brew install gois missing for the dependencies in your fork.Sadly Intel iGPUs dont seems to play well with this vulkan build.
@mrglutton commented on GitHub (Nov 15, 2025):
I don't know what you did wrong, but this is not correct. I have run llama.cpp on various intel iGPU across generations, and they all performed well. The performance was irrelevant, I just played with them to see if they would work.
@tristan-k commented on GitHub (Nov 17, 2025):
@mrglutton Sure, I can confirm that
llama.cppis playing nice with the iGPU on macOS but not withOllama.@GoMino commented on GitHub (Nov 18, 2025):
I just tried it and it seems to work correctly at the moment on my MacBook Pro 16" (intel with AMD Radeon Pro 5500M 8GB) @jamfor999 (small model only)
@n-connect commented on GitHub (Nov 26, 2025):
The shortest working steps building llama.cpp, since v1.4 Molten-VK here is this gist, only 3 build parameters for llama.cpp, but can work with 2 as well (- the curl one)
@raparici commented on GitHub (Dec 7, 2025):
I tried it and works perfect for me with 2019 iMac with internal Vega 48 and a RX6900XT over thunderbolt running Sequoia. It's awesome!
@mrglutton commented on GitHub (Dec 7, 2025):
What was your compile procedure and usage case?
@raparici commented on GitHub (Dec 7, 2025):
I built llama.cpp with Vulkan support via Molten as @n-connect described:
https://gist.github.com/n-connect/9a7975980f36e187175b0d35e7e52ade
The resulting llama-cli finds gpus and does the layer splitting among them:
./llama-cli -m DeepSeek-R1-Distill-Qwen-14B-Q4_0.gguf -cnv
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6900 XT (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Pro Vega 48 (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
...
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6900 XT) (unknown id) - 16368 MiB free
llama_model_load_from_file_impl: using device Vulkan1 (AMD Radeon Pro Vega 48) (unknown id) - 8176 MiB free
...
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CPU_Mapped model buffer size = 417,66 MiB
load_tensors: Vulkan0 model buffer size = 4900,16 MiB
load_tensors: Vulkan1 model buffer size = 2824,94 MiB
@1zilc commented on GitHub (Dec 8, 2025):
If anyone needs it, I created a daily build repo to build llama.cpp with vulkan enabled for intel mac
https://github.com/1zilc/llama.cpp-mac_x64-vulkan/releases
@mrglutton commented on GitHub (Dec 8, 2025):
Thank you. :-)
@chafey commented on GitHub (Jan 17, 2026):
Thanks for doing this - it worked with my 2x Vega II Duo system and my 2x w6800X Duo system. It only detected 3/4 gpus on the 2x Vega II system and only 2 GPUs on the 2x w6800X Duo system. All four GPUs are showing up in the system information/pci list. Any suggestions on how to troubleshoot this to get all GPUs active?
@nepomucen-sexp commented on GitHub (Jan 19, 2026):
Thank you @jamfor999! It is working great on my Intel MacBook with AMD Radeon Pro 5500M 8 GB. Merging it to upstream would be awesome.
@bradrlaw commented on GitHub (Jan 24, 2026):
good stuff @jamfor999 I was abl to get this working on a 2019 imac with 128gb ram / 580X 8GB. Works with VScode and most tools, but seem to run into issues using claude and similar tools. Have you had any luck with those?
Edit: Looks like I am running into this issue:
https://github.com/anthropics/claude-code/issues/20416
@PrAntini commented on GitHub (Feb 3, 2026):
@jamfor999 , could you explain more clearly how to build your version? I had issue:
./scripts/build_darwin_vulkan.sh: line 47: go: command not found
@alifeinbinary commented on GitHub (Feb 3, 2026):
@PrAntini do you have Go installed on your Mac?
https://formulae.brew.sh/formula/go#default
@PrAntini commented on GitHub (Feb 3, 2026):
My bad! I am noobish ) Forget about Go. I did it! Thank you for your quick responce! Big thx to alifeinbinary and @jamfor999
@Deep345 commented on GitHub (Mar 18, 2026):
Hi, I just wanted to ask if there is any update on the status of this request? If we could have amd gpu support for macos on mainstream ollama, i'm sure many users would appreciate the performance gains!
@jamfor999 commented on GitHub (Mar 21, 2026):
I have created a pr to merge my fork upstream.
Although it does not mean Ollama will necessarily make binaries available for it even if it is merged