mirror of
https://github.com/ollama/ollama.git
synced 2026-05-06 16:11:34 -05:00
[GH-ISSUE #15237] New Gemma 4 models run on CPU, they say they are running on GPU (FA Enabled) #35505
Closed
opened 2026-04-22 20:03:45 -05:00 by GiteaMirror
·
42 comments
No Branch/Tag Specified
main
dhiltgen/ci
parth-launch-plan-gating
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#35505
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @sammyvoncheese on GitHub (Apr 2, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15237
What is the issue?
Update 4/4: Issue related to the FA feature being turned on.
Models seems to load into GPU then jump to CPU. PS shows model running in GPU.
I tried 2b and 4b bf16, and the 26/31b 4q on 5090 with context set to 130k
Example output from ps.
gemma4:e2b-it-bf16 850bc7fea32f 12 GB 100% GPU 130000 57 minutes from now
From the log:
time=2026-04-02T14:00:19.543-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
Relevant log output
OS
Windows
GPU
Nvidia
CPU
AMD
Ollama version
0.20.0-rc1
@z0n1q commented on GitHub (Apr 2, 2026):
I can confirm that Gemma4 models make very little use of the GPU. It looks like Ollama is offloading some layers to the CPU. May only need some optimization.
OS:
Ubuntu 24.04
GPU:
3x RTX 6000 Pro Blackwell
CPU:
TR 9955WX
@sammyvoncheese commented on GitHub (Apr 2, 2026):
V0.20.0 same issue.
@SingularityMan commented on GitHub (Apr 2, 2026):
Same
@ErikEngerd commented on GitHub (Apr 2, 2026):
seeing it as well. Getting to use all CPUs on the system.
@craftpip commented on GitHub (Apr 2, 2026):
i see the same thing when using gpt-oss:20b too, the GPU is not used.
I'm trying to run it on AMD 7900xtx
@resc863 commented on GitHub (Apr 2, 2026):
also same on my RTX 4080 PC.
No GPU usage with E4B but 26B MoE works well on GPU
@alerque commented on GitHub (Apr 3, 2026):
Cannot reproduce here, the graphics card takes the load.
I just ran gemma4:2b, 4b, and 26b models and all of them showed a small spike on both CPU and GPU at the beginning of processing, thereafter the CPU dropped out and just the GPU stayed loaded up until the request is complete. Ryzen AI 9 HX 370 w/ Radeon 890M.
@PythonLawrence commented on GitHub (Apr 3, 2026):
Sorta the same. Percentages displayed by ollama (below) are accurate though. Gemma4 (e2b q4) not using much of the 8.1GB available on the dedicated RTX 4070 Laptop GPU. ~2.4GB used with 16.4K context, then ~2.9GB used with 32.8K context, and then finally ~4GB used with 65.5K context; interestingly the a4b model had less such issues with 6.6GB on the GPU with a low context length!
@mazphilip commented on GitHub (Apr 3, 2026):
Env: Ubuntu 24.04, NVIDIA driver 580.126.09, CUDA 13.0, dual 3090 + 5090 (54GB VRAM)
Fix: Flash attention (+ reduced context) in /etc/systemd/system/ollama.service.d/override.conf:
Result: 912 tk/s prompt and 45 tk/s eval via ollama run --verbose, ollama ps reports 100% GPU, 61/61 layers offloaded.
54.46% libggml-cpu-haswell.so ggml_compute_forward_flash_attn_ext
39.58% libggml-cpu-haswell.so ggml_vec_dot_f16
- OLLAMA_GPU_OVERHEAD=0 — no change in allocation; 1.2 GiB of weights remained on CPU
- OLLAMA_KV_CACHE_TYPE=q8_0 — collapsed to single GPU, different issues
Unsolved: What are the 1.2 GiB of CPU-side weights and why do flash attention + dot product ops run on CPU despite full layer offload? If anyone has insight, would appreciate it.
Edit: It seems the 1.2GiB are the vision encoder weights that are not offloaded by Ollama/llama.cpp to the GPU. Might be related to #11422
@tjwebb commented on GitHub (Apr 3, 2026):
Same problem:
ollama psreports 100% GPU, but logs show some stuff getting loaded onto CPU.Eyeballing to
topandnvtop, it looks like 3/4 of the work is being done by the CPU, and overall performance is much slower than expected. GPU is only running at ~20% capacityCPU: Xeon 6 6747P
GPU: RTX 6000 Pro
@nickkaltner commented on GitHub (Apr 3, 2026):
AMD RYZEN AI MAX+ 395 w/ Radeon 8060S here.
I see the same behaviour - as a prompt is evaluated, the gpu usage slowly goes down and the cpu usage up. i have tried with rocm, vulkan and it's the same thing.
it shows 100% gpu with both gemma4:26b and gemma4:31b but both of them are definitely using cpu!
@seawindcn commented on GitHub (Apr 3, 2026):
V0.20.0 same issue.
@somera commented on GitHub (Apr 3, 2026):
Same here ... 50% CPU usage
Ollama v0.20.0 with RTX PRO 6000 96GB Server Edition with 8-11 Tokens/s.
Ubuntu 24.04.x, Nvidia Driver 580.126.20
@rabinnh commented on GitHub (Apr 3, 2026):
I have the same issue. I have 2 Nvidia RTX 3090s and I have conky loaded so I can see the memory of each GPU in real time.
The memory ping-pongs between the 2 GPUs until it finally starts executing on the CPU:
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma4:31b 6316f0629137 63 GB 100% CPU 262144 4 minutes from now
All the other modesl run on the GPUs fine.
Another issue is that when I switch to another model, and it's running on the GPUs, Ollama never unloads gemma4:32b and my CPU load is maxed out, the temperatures and fans go way up, and I have to run "sudo systemctl restart ollama" to get everything back to normal.
NAME ID SIZE PROCESSOR CONTEXT UNTIL
richardyoung/kat-dev-72b:iq4_xs 14bbcc414a53 43 GB 100% GPU 8192 4 minutes from now
gemma4:31b 6316f0629137 63 GB 100% CPU 262144 4 minutes from now
@PurpleBanana-ai commented on GitHub (Apr 3, 2026):
My apologies in advance if this is not the appropriate format for posting this info, I generally do not post on these. Running with Open WebUI is worse, especially with any type of tool call like web search, but for this case below, this is Ollama straight in the terminal. FYI, I am showing the same issue with a gguf model from unsloth for gemma4, not just the ones directly downloaded from Ollama. Was seeing my cpu package temp touch 60c+, which is not something I see with my cooling unless running intense benchmarks, never for inference or even diffusion.
Setup
Debian 13, Cuda 13.2, Driver 595.58.03
i9-14900k 790 chipset
94GB DDR5 6400
m.2 NVME (CPU Side PCIE Bus)
GPU 0: RTX 5090 32GB (CPU Side PCIE Bus - PCIE5 Slot x8)
GPU 1: RTX 3090 24GB (CPU Side PCIE Bus - PCIE 5 Slot x8)-yes the GPU is at PCIE4
GPU 2: RTX 5070ti 16GB (Chipset Side PCIE Bus - PCIE4 16x slot at x4)
GPU 3: RTX 5070ti 16GB (Chipset Side PCIE Bus - m.2 PCIE4 x4 to Occulink EGPU)
(no need to dog the frankenrig, she is fine, this is the only model I am having issues with, I will try it on llama.cpp and vLLM as well later.)
Same issues as above just a different config, but only showing issue with gemma4, any of the model versions, any quant, any ctx size. I am seeing the model weights offloaded to the CPU in the logs below. I have tried creating model files with static offloading gpu layers to 999 but no difference. gemma4 is also running very slow, even if I pin the 31B at 4K_M quant to my 5090 with 8192 ctx, it is no different than across multiple GPU's. ~13tps - 15tps for gemma 4. Load time is as expected across multiple cards with this config, not an issue.
This example and the logs is with the following Modelfile (FA on in env, no docker):
qwen3.5 32B A3B Q8 is fine even at 245,760 ctx.
qwen3-next:80b-a3b-thinking-q4_K_M is fine at 204,800 ctx.
FYI Comparison Same Basic Prompt, Think mode enabled for gemma4 (doesn't change the issue if I turn it off):
gemma4
ollama run gemma4-31b-q8_0-custom:latest --verbose
Performance:
total duration: 1m56.786501477s
load duration: 131.067798ms
prompt eval count: 39 token(s)
prompt eval duration: 145.114255ms
prompt eval rate: 268.75 tokens/s
eval count: 1612 token(s)
eval duration: 1m55.901022209s
eval rate: 13.91 tokens/s
qwen3-next:80b-a3b-thinking-q4_K_M
Same Prompt (minus the think tag token) for qwen3-next:80b-a3b-thinking-q4_K_M at 204,800 ctx:
total duration: 55.166186993s
load duration: 81.913138ms
prompt eval count: 33 token(s)
prompt eval duration: 125.400091ms
prompt eval rate: 263.16 tokens/s
eval count: 4901 token(s)
eval duration: 54.000393967s
eval rate: 90.76 tokens/s
Ollama Logs for gemma4:
Not sure it helps, happy to provide more info.
@zestysoft commented on GitHub (Apr 3, 2026):
fwiw, seeing the same behavior on a mac in ollama 0.20:
gemma4:31b
M3 MAX Processor with 128GB of ram.
ollama ps shows the model loaded with 100% GPU, but mactop shows 600+ CPU utilization with very little GPU.
@Wladastic commented on GitHub (Apr 4, 2026):
Hm, weirdly with version 20.2 I ran it inside ollama, 32k context, no cpu usage, only one thread on my cpu being used, but answer came quick.
Then tested same 32k context via openclaw, all 32 CPU cores are running now o.O
@homjay commented on GitHub (Apr 4, 2026):
Based on my observations, the glitch is triggered specifically when sending a second prompt. This behavior is highly unusual.
@sergiosaurio commented on GitHub (Apr 4, 2026):
In my case using curl or python library produces the same results:
45% CPU usage and 5% GPU aprox per prompt.
Using the CLI or Ollama app works fine.
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma4:26b 5571076f3d70 21 GB 100% GPU 16000 Forever
@alerque commented on GitHub (Apr 4, 2026):
My use case involves using a separate Rust app that calls the API via the TCP port (via the
rigcrate). That works fine and the model runs under GPU when called via API calls from the socket as well as via the ollama CLI. I don't know what the your Python calls would be doing differently than that.@sammyvoncheese commented on GitHub (Apr 4, 2026):
20.2 CPU vs GPU. Calling a tool.
gemma4:e4b-it-bf16 d0d10a1b1ddb 21 GB 100% GPU 130000 57 minutes from now
Same Model only generating text.
@somera commented on GitHub (Apr 4, 2026):
Not usable at the momment. v0.20.2
AMD EPYC 9355 32-Core Processor + RTX PRO 6000 96 GB
MIx of CPU and GPU usage:
And very low tokens/s.
Restart ollama and than:
and now longer prompt:
and more longer prompt:
For the last prompt:
@chenav commented on GitHub (Apr 4, 2026):
+1 on wsl2 (last version) docker and 5090
@SingularityMan commented on GitHub (Apr 4, 2026):
Ubuntu 22.04 showing same issues.
@alerque commented on GitHub (Apr 4, 2026):
@somera and others, try the 26b or smaller models instead of 31b. 31b seems to need WILDLY more video RAM than it probably aught to. I have 96G video memory available too (shared) and it taken north of 60G to run the 26b model. The 31b model starts loading then runs out of ram and starts chucking through something on one CPU. I suspect a lot of the issues in this thread probably have memory issues as their root cause, not CPU/GPU routing.
@SingularityMan commented on GitHub (Apr 4, 2026):
I'm already using the 26b model, and I have 48GB VRAM available. Doesn't matter which context length it is set to.
@alerque commented on GitHub (Apr 4, 2026):
As I just mentioned the 26b model eats through about 60GB VRAM when I run it. Try one of the even smaller models.
@Wladastic commented on GitHub (Apr 4, 2026):
I cannot confirm it to be a ram issue.
31b Model with 22k token prompt just ran through in about 1-2 seconds.
Once a toolcall is mentioned it reverts to cpu
@somera commented on GitHub (Apr 4, 2026):
I don't see a VRAM issue in my system. I see a CPU+GPU usage.
@mazphilip commented on GitHub (Apr 4, 2026):
I did more digging, this seems to be a Flash Attention issue with Gemma4 (upstream, either Flash attention or llama.cpp) - somehow only triggered when traying to run coding agents? (ollama launch claude, ollama launch vscode)
While you can force the FA usage, which makes ollama allocate the memory on the GPU(s), but once you run it (with longer context?), something happens and it moved all the calculation to the CPU.
I get very good performance (1500tk/s prompt, 50tk/s eval) when running (using
/etc/systemd/system/ollama.service.d/override.conf)Resolution steps:
@slamj1 commented on GitHub (Apr 4, 2026):
I can confirm @mazphilip finding(s) with respect to FA. Turning off FA in the service seems to solve the issue of CPU offload. Note that the CPU offload only seems to occur when calling via the API. Ollama CLI works fine.
My example usage uses gemma4:31b, with 128K context and takes about 71 GB VRAM. With FA disabled this config works well.
@somera commented on GitHub (Apr 4, 2026):
I'm not using coding agents with ollama and I have the issues with
ollama run <model>and from Open WebUI and small context (4096).@sammyvoncheese commented on GitHub (Apr 4, 2026):
I was able to confirm that disabling FA caused the model layers to stay on the GPU now.
@SingularityMan commented on GitHub (Apr 4, 2026):
Can confirm, disabling FA on Ollama seems to correctly offload everything to GPU now.
@viba1 commented on GitHub (Apr 4, 2026):
On my side, disabling FA works correctly for models running 100% GPU, but the issue remains for models that split their workload between the CPU and GPU.
For exemple:
Gemma4:26b: 21% CPU / 79% GPU ; ~ 1.2 token/s
Gemma3:27b: 19% CPU / 81% GPU ; ~3 token/s
@mazphilip commented on GitHub (Apr 5, 2026):
I managed to make this work migrating this llama.cpp PR over: https://github.com/ggml-org/llama.cpp/pull/20998
Opening a PR
@Cephei-OpenSource commented on GitHub (Apr 5, 2026):
I also can confirm, turning OLLAMA_FLASH_ATTENTION=false (or 0 as some suggest - both seem to work) will immediately sharply boost the performance of Gemma 4 (installed: gemma4:31b). Before 20 t/s - after 60 token/s.
@Hello-World-Traveler commented on GitHub (Apr 6, 2026):
Turning OLLAMA_FLASH_ATTENTION to false makes little difference
gemma4:e4b 10 GB 66%/34% CPU/GPU 4096 4 minutes from nowOLLAMA_FLASH_ATTENTION to 0
gemma4:e4b 10 GB 66%/34% CPU/GPU 4096 4 minutes from nowgemma3:4b 5.4 GB 100% GPU 4096 4 minutes from now
@tjwebb commented on GitHub (Apr 6, 2026):
yep disabling FA worked for me
@m0n5t3r commented on GitHub (Apr 6, 2026):
another data point that disabling FA works if you have enough VRAM (in my case Ryzen AI Maz 395, with 64 GB allocated to the GPU): before I was seeing between 25 and 75% GPU usage with
gemma4:26band 21 GB of VRAM used, now I see close to 100% GPU use and 38 GB of VRAM used (and it is much faster)ollama pssaid 100% GPU in both cases.@Hello-World-Traveler commented on GitHub (Apr 6, 2026):
Turing off thinking does make it faster with about 19 t/s
66%/34% CPU/GPUDoesn't make much difference for me
I am using docker with gemma4:e4b and OLLAMA_NEW_ENGINE=true
@roxlukas commented on GitHub (Apr 7, 2026):
Confirmed, with OLLAMA_FLASH_ATTENTION=1 on Gemma4:26B there is heavy CPU usage (50-60%) and eval token speed hovers around 30 token/s, even for the e4b variant!
with OLLAMA_FLASH_ATTENTION=0 token speed on Gemma4:26B jumps to 108 tokens/s on RTX 3090
In both cases Ollama reports full GPU inference:
ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma4:26b 5571076f3d70 21 GB 100% GPU 32768 4 minutes from now
env:
Ollama 0.20.3
Windows 11
i5-11400F
64GB DDR4
RTX 3090