mirror of
https://github.com/ollama/ollama.git
synced 2026-05-06 16:11:34 -05:00
[GH-ISSUE #12197] Some requests get processed on CPU, even though model is loaded in GPU (GPT-OSS) #33873
Open
opened 2026-04-22 17:00:03 -05:00 by GiteaMirror
·
55 comments
No Branch/Tag Specified
main
dhiltgen/ci
parth-launch-plan-gating
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#33873
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @shiraz-shah on GitHub (Sep 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12197
Originally assigned to: @ParthSareen on GitHub.
What is the issue?
The "overlapping GPU/CPU" feature from the last version results in some requests always being processed on the CPU only, even though the model is loaded in GPU. These requests take way longer than they should as a result. Like 10 times longer.
This happens even though the GPU is completely idle. It also happens across multiple platforms and installations of Ollama.
E.g. I can have two coding editors open. One code editor's requests get handled on the GPU, while the other's get handled consistently on the CPU, even when the GPU is idle. If I change the code editors to send API requests with a different server (with a different GPU, CPU and RAM amount) but with the same Ollama version, the same editors respectively will keep using GPU vs. CPU as with the first server, so that it must be something about CPU/GPU routing logic within Ollama, and not hardware constrains.
Is there any way to disable the overlapping GPU/CPU capability?
Relevant log output
OS
No response
GPU
No response
CPU
No response
Ollama version
No response
@rick-github commented on GitHub (Sep 6, 2025):
Server logs will help in debugging.
@shiraz-shah commented on GitHub (Sep 6, 2025):
From 8-core i5 machine with 64 gigs of system ram and RTX 4090. The multi-minute reqests are the CPU ones, whereas the ones that took less than a minute were the GPU ones:
@rick-github commented on GitHub (Sep 6, 2025):
What's the output of
ollama ps? SetOLLAMA_DEBUG=1in the server environment and post the logs with the extra debug information.@shiraz-shah commented on GitHub (Sep 6, 2025):
From 40-core HP Proliant with 512 GB system memory and dual RTX 3060s. The 17-minute request was the one that done on the CPU here:
@shiraz-shah commented on GitHub (Sep 6, 2025):
Ollama ps:
Nvidia-smi:
top:
So GPU utilisation is 0% even though VRAM is booked. And CPU utilisation is 700% here
@shiraz-shah commented on GitHub (Sep 6, 2025):
Server log with debug mode set to on, while it's doing inference on CPU, with while ollama ps says 100% GPU:
@shiraz-shah commented on GitHub (Sep 6, 2025):
In the meantime, I can run the same model on cli with
ollama run gpt-oss-longand sayHey, and that request gets executed on GPU, while the other one is still running on CPU. The server log then has the following appended to the above:@shiraz-shah commented on GitHub (Sep 6, 2025):
Next, my code editor times out waiting for the CPU-routed request to finish. The log gets updated with this:
while the code editor looks like this:
@whp-Henry commented on GitHub (Sep 7, 2025):
I encountered similar issues. If there are no necessary new features, try downgrading to Ollama v0.11.8 in order to avoid the feature "Improved performance via overlapping GPU and CPU computations" introduced in v0.11.9
@shiraz-shah commented on GitHub (Sep 7, 2025):
OK, I just downgraded to 0.11.8 as shown below.
I still have the same problem. Some requests end up on CPU, while others get handled by the GPU.
How I downgraded:
@rick-github commented on GitHub (Sep 7, 2025):
"Improved performance via overlapping GPU and CPU computations" doesn't mean that inference is run on different processors. It means that while the GPU is running an inference, the CPU is preparing the state necessary for the next step of the inference. This has always been the case but previously has been serialized (CPU->GPU->CPU->GPU), the change in 0.11.9 was to allow these to overlap so that the GPU is not waiting as long for the CPU to finish preparation.
The logs posted so far show that the model is running 100% on the GPU.
Since the issue is happening on 0.11.8, it's unrelated to the pipeline optimization.
It's not clear if this is the server or the runner. What's the output of
@shiraz-shah commented on GitHub (Sep 7, 2025):
Looks like this when inference is happening on GPU:
But top says:
and nvidia-smi says:
And when inference is happening on CPU, ps says:
But top looks like so:
and nvidia-smi says:
Don't know why there's this difference between what top shows and what ps shows
But looks like when it's the GPU inferring, the runner is using between 150 and 300% CPU. And when it's the CPU inferring, the runner is using almost 0, as is GPU, while the server is using on average 750% in top. But the ps command from above only shows 75%, don't know why.
So I guess the GPU is inferring in both cases, but in some cases it's the server that's the bottleneck. This is also consistent with "CPU inference" not being memory-heavy.
Wonder if this intermittent server bottlenecking issue is GPT-OSS-specific. And whether it has anything to do with flash attention. Don't remember such issues before flash attention was introduced for GPT-OSS.
@rick-github commented on GitHub (Sep 7, 2025):
There is no inference running on the CPU. The CPU is busy in the server, not the runner. Exactly why the CPU is busy in the server is unclear. What's the output of the following when the server is busy:
@shiraz-shah commented on GitHub (Sep 8, 2025):
@rick-github commented on GitHub (Sep 8, 2025):
Seems like you are running multiple ollama servers. What's the output of
@shiraz-shah commented on GitHub (Sep 8, 2025):
No, I don't think so. Not on purpose anway.
@shiraz-shah commented on GitHub (Sep 8, 2025):
Under load it looks like this:
@rick-github commented on GitHub (Sep 8, 2025):
My mistake. Try:
@shiraz-shah commented on GitHub (Sep 8, 2025):
I modified your command to make it work better. It was tracing the "grep" command because that had the same string repeated, hence the lack of output before.
remaining output attached
straceOllama.txt
@shiraz-shah commented on GitHub (Sep 8, 2025):
Here's how it looks when there's no load:
@shiraz-shah commented on GitHub (Sep 8, 2025):
And attached is how it looks under "correct" load (i.e. well-functing GPU inference without the CPU bottleneck)
straceOllamaGPU.txt
@shiraz-shah commented on GitHub (Sep 9, 2025):
The problem seems to be GPT-OSS-specific.
Maybe it's related to context quantisation. Not sure though.
Can I disable context quantisation for this model without having to downgrade to 0.11.7?
@scotty2 commented on GitHub (Sep 9, 2025):
I'm having the same problem (I think)
gpt-oss-120b, M4 Max 128GB MacBook Pro.
Running codex, ollama sits there using 800-900% CPU, then swaps to GPU for a bit, then back to CPU.
ollama ps shows 100% GPU offloaded, but this is obviously not the case.
It looks to me like it's doing prompt processing on the CPU. Inference itself does seem to actually be happening on the GPU.
ollama 0.11.10, arm64, macOS 15.6.1
@scotty2 commented on GitHub (Sep 9, 2025):
It is between these 2 log entries in a request that ollama is completely CPU bound. This gap gets larger and larger with increasing context size, it seems.
After that, the load moves to the GPU, and it returns the completion.
@scotty2 commented on GitHub (Sep 9, 2025):
ollama run details:
@eggshake commented on GitHub (Sep 19, 2025):
I'm having the same problem.
Using gpt-oss and had no issues with 0.11.6
@jessegross commented on GitHub (Sep 19, 2025):
Possibly related to the Harmony parser.
@shiraz-shah commented on GitHub (Sep 20, 2025):
EIther that or the introduction of context quantization maybe.
I experience the problem mostly with agentic workloads. Haven't felt it with chat loads.
And yes, it does seem to scale with context size.
@scotty2 commented on GitHub (Sep 22, 2025):
KV quantization should be disabled on my run (OLLAMA_KV_CACHE_TYPE=)
Harmony parser doesn't seem likely to me since it maxes all of my CPU cores. Smells very much like compute that has been scheduled using a CPU kernel, rather than some kind of serial parsing.
@shiraz-shah commented on GitHub (Sep 22, 2025):
Are you sure it's disabled though? I feel like GPT-OSS, the way ollama treats it, disregards your context quant settings and does its own thing. Have you checked how your GPU footprint scales with increasing context size?
@scotty2 commented on GitHub (Sep 22, 2025):
I can't be 100% sure- only that if that's unset, it's supposed to use fp16.
I can definitely try to get a gauge of the quantization via memory use.
I do agree that it seems that the prompt processing is being scheduled on the CPU, which is at least at some level related to the KV caches, whether they're quantized or not.
@shiraz-shah commented on GitHub (Sep 22, 2025):
It's because I don't remember this problem before context quantization was introduced for GPT OSS. But back then I was also running way smaller context windows. So yes, could definitely be prompt processing over context quantization.
@scotty2 commented on GitHub (Sep 22, 2025):
There is definitely something funny going on in KV quantization land.
honestly, i can't make any sense of those numbers.
@scotty2 commented on GitHub (Sep 22, 2025):
Prompt processing (pre-fill) creates the KV cache, so quantization would (I assume) be hooked into there. That's the compute-heavy workload that I think we're seeing offloaded to our CPUs.
@shiraz-shah commented on GitHub (Sep 22, 2025):
Off topic, Ollama docs are not transparent about how context quant is done and how manual settings are handled for different models. For some models it just disabled no matter what.
And my theory is, that for GPT OSS it defaults to Q4 no matter what you set.
Nice detective work though!!
@scotty2 commented on GitHub (Sep 22, 2025):
The numbers for GPT-OSS definitely support your hypothesis that OLLAMA_KV_CACHE_TYPE is meaningless for GPT-OSS.
@rick-github commented on GitHub (Sep 22, 2025):
https://github.com/ollama/ollama/pull/11929
@scotty2 commented on GitHub (Sep 22, 2025):
@shiraz-shah commented on GitHub (Sep 23, 2025):
Yes, that's exactly how it is on linux as well.
Looks like @rick-github has provided the verdict though. It has to do GPT OSS's use of attention sinks for KV quant, which can't be CUDA'd as of today. No quick fix for this, I guess.
Before we close this, @scotty2 , since you're running this on a mac, have you tried wether you get the same problem in LM Studio using the MLX version of the model?
@scotty2 commented on GitHub (Sep 23, 2025):
I do not have the same problem on LMS, via llama.cpp backend or MLX.
@scotty2 commented on GitHub (Sep 23, 2025):
I tested with the same workload (codex) pointed at LMS, and it does not exhibit the behavior.
MLX doesn't fully support MXFP4 at the moment, so it's quite broken in other ways, but not this particular way.
The llama.cpp backend though should be an equivalent comparison.
@jessegross commented on GitHub (Sep 23, 2025):
As Rick pointed out, KV cache quantization is disabled for gpt-oss, so the setting has no effect and is not the cause of the issue here. In addition, inference does not run on the Ollama server process.
Looking at these log lines, the time in between them is after a request is received and before it is passed to the runner for inference. The load moves to the GPU because inference is actually happening on the GPU, as reported. Therefore, it must be something else that happens on the server and CPU, such as parsing.
If you post the log with OLLAMA_DEBUG=2 set, we might be able to reproduce the issue. WARNING: This will include user data and will be large.
@scotty2 commented on GitHub (Sep 23, 2025):
It's clear that inference is happening on the GPU. The question was whether or not pre-fill/prompt processing was happening on the main process. If not, then something is parsing with a very, very high degree of parallelism. Which is a cool trick, for sure.
I will be happy to provide an OLLAMA_DEBUG=2 dump.
@scotty2 commented on GitHub (Sep 23, 2025):
Without sanitizing the log yet, it's very obvious where the CPU-bound load is happening.
So, during tokenization, perhaps?
As it is dumping that very long list of tokens, it gets slower and slower.
Next log line (after the multi-line token ID dump) is:
At which point, we're on the GPU (or rather, visually, whatever comes next is indistinguishable on the load graphs).
@nfsecurity commented on GitHub (Sep 24, 2025):
I think I am experimenting the same behavior:
Some inferences (not all, for example: 6/30) get stuck and doesn't respond in normal times, for example, the majority of my 100% GPU inferences take between 1 and 12 seconds, but when one of them get stuck, it takes 11 minutes to complete. I am investigating this behavior and saw an excessive CPU usage DURING that specific problematic inference and an IDLE GPU Utilization (see the htop image) and my first conclusion was: "that specific inference was processed entirely by the CPU, not the GPU", despite my ollama ps shows that all my model was loaded 100% on the GPU.
I did several tests and "I think" I found the cause: "this is happening only with large SYSTEM content", let me explain:
All my prompts are made by a SYSTEM, DEVELOPER and USER instructions. My SYSTEM and DEVELOPER content is around 1024 tokens and then, the USER content could be another 2048 tokens. If I reduce the SYSTEM and DEVELOPER length, the behavior of excessive CPU consumption in some inferences doesn't happen.
My conclusion at this point is maybe those large PROMPTS are causing some kind of bottleneck, but not at the first try (because all my prompts are large and works well the majority of the time), is something like a buffer overflow. If I try to do the same problematic inference, isolated (only that one, manually), that inference works well, but in "batch" is not the same case.
My temporary solution was to reduce the length of the SYSTEM until more investigation.
Other very interesting things I have found:
That is not an Ollama only related issue. The inference through Unsloth suffers the same problem with large prompts (some of them get stuck with heavy CPU usage and takes 10 minutes to respond instead of 10 seconds). I was able to reproduce this same behavior in Unsloth and the same solution works (reduce the system and developer length).
I have a NVIDIA RTX 6000 ADA SFF 48GB GPU and I am running GPT-OSS 20B (pulled from Ollama) and also the problem occurs when I run my fine tuned GPT-OSS 20B in GGUF format MXFP4.
Hope this helps!
@jessegross commented on GitHub (Sep 24, 2025):
@scotty2 Possibly, there is a token counting step as part of the preprocessing on the server. (This is not prompt processing, that still happens on the runner.) However, tokenization also prints out a lot of log lines so that could cause slow downs with debug logging.
It would be most helpful if you could share the actual prompts that trigger this so that we can reproduce it, as it is likely dependent on the actual content as @nfsecurity pointed out. Sanitizing the logs unfortunately will remove this.
@T1bolus commented on GitHub (Mar 29, 2026):
Can confirm, still a huge problem for big prompts >100k tokens. Its not the preprocessing and its not the normal inference itself. Both run completely on GPU as intended. I sadly couldn't pinpoint exactly what it is processing, but its the ollama serve process and it does not using all cores, just a few.
@shiraz-shah commented on GitHub (Mar 30, 2026):
I've stopped using GPT-OSS for this reason.
@T1bolus commented on GitHub (Mar 30, 2026):
It is not an GPT OSS specific model problem. It also happens with Qwen3.5, Nemotron 3 Super and all other models I have tested.
@shiraz-shah commented on GitHub (Mar 30, 2026):
OK, that's interesting. For me it does not happen with Qwen3.5 27B and Nemotron Cascade 2. Haven't tested Super extensively yet. Also doesn't happen with GLM 4.7 Flash and Qwen 3 Coder 30B. So far I've only really seen it with GPT-OSS.
@T1bolus commented on GitHub (Mar 30, 2026):
Have you tried it with context length above 100k?
@shiraz-shah commented on GitHub (Mar 31, 2026):
This thread is about CPU-bound prompt processing so that's what I thought you meant.
But to answer your question, yes. Prompt processing time does go up increasing context length for all models. I've tried up to 262,144 tokens. But with the other models it's not CPU-bound. With GPT-OSS some of the prompt processing is happening on the CPU, and it's therefore much slower than the other models with large prompts.
It varies between models and hardware, but as a rule of thumb, I see prompt processing speed around 1000 tps, meaning a 100k prompt can take two minutes on the GPU. For GPT-OSS it can take 10 minutes.
@T1bolus commented on GitHub (Mar 31, 2026):
For me its CPU bounded as well, but well more noticeable with large context sizes. So everything is on the GPU, but for whatever reasons the CPU goes wild and is doing some preprocessing which can take up to a minute. And across all models. So not just GPT OSS, nearly all models.
This does not happen on vLLM as example.
@shiraz-shah commented on GitHub (Apr 1, 2026):
I haven't tried vLLM yet.
In my experience, with ollama, a minute of prompt processing for a 100k token query is normal, especially if the GPU is fully engaged. What's strange with GPT-OSS is that the GPU is flat while this happens, and instead the CPU is working with 8 threads or so for several minutes before the GPU becomes active and starts generating tokens.
@eliphatfs commented on GitHub (Apr 9, 2026):
i met the same issue, on gemma 4 31b, on openclaw default system prompt like 80k tokens. i don't know why it is so slow either. it doesn't finish in 1 minute.