mirror of
https://github.com/ollama/ollama.git
synced 2026-05-06 16:11:34 -05:00
Closed
opened 2026-04-12 19:50:08 -05:00 by GiteaMirror
·
91 comments
No Branch/Tag Specified
main
dhiltgen/ci
parth-launch-plan-gating
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#7723
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @nadamas2000 on GitHub (Aug 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11676
What is the issue?
Hello,
I've noticed an issue with GPU utilization. When running the gpt-oss:20b and gpt-oss:120b models, Ollama seems to be running them entirely on the CPU.
My NVIDIA GPUs (RTX 4070-Ti 16GB and RTX 3060 12GB) remain completely idle according to nvidia-smi and Task Manager, while my CPU usage is maxed out. I would expect these models to be loaded onto the GPUs for accelerated performance.
Key Information:
Steps to Reproduce:
Thanks for your great work on this project. Let me know if you need any more information.
server.log
@russellmm commented on GitHub (Aug 5, 2025):
server.log
Can confirm. Same issue for me. Swapped over to qwen3:30b just to be sure and it is using the GPU fine.
@Shawneau commented on GitHub (Aug 5, 2025):
Can confirm not working in Docker on Nvidia gpu, while other models load fine. Host is Ubuntu 22.something
@jessegross commented on GitHub (Aug 5, 2025):
Can you please post the server logs?
@av commented on GitHub (Aug 5, 2025):
@jessegross , sorry for the extra
pulllogs in the middleDetails
Similar setup, Ollama v0.11 + Docker, other models use GPU as expected
@hrz6976 commented on GitHub (Aug 5, 2025):
Same here on 4xL40s.
@jessegross commented on GitHub (Aug 5, 2025):
@av @hrz6976
It looks like you both increased OLLAMA_NUM_PARALLEL. I would recommend leaving it at the default setting as higher values use more VRAM and reduce ability to offload.
@nadamas2000 commented on GitHub (Aug 5, 2025):
Thanks for the suggestion. I've confirmed that I'm using OLLAMA_NUM_PARALLEL=1. I have updated the issue description with the latest logs.
@Shawneau commented on GitHub (Aug 5, 2025):
I have a feeling everyone experiencing this is hitting Ollama via OpenWebUI? Command line Ollama works with 100% GPU, but hitting it with open web ui goes 100% CPU
@av commented on GitHub (Aug 5, 2025):
Understandable!
Setting
OLLAMA_NUM_PARALLEL=1the split is now:With
OLLAMA_NUM_PARALLEL=4, it looks like:So, possibly something is off with either
psor the estimator, as clearly batching should allocate more memory.In both instances it only uses ~12.9 GB of VRAM, leaving some space unallocated, I hope there's some way to use that and improve the performance a bit.
@jessegross commented on GitHub (Aug 5, 2025):
@nadamas2000 It looks like you increased the context length, this has a similar effect to increasing NUM_PARALLEL. You'll need to use a lower value or the default.
@nadamas2000 commented on GitHub (Aug 5, 2025):
Ok, in my case, with 4k context GPUs running well.
Thanks.
@hrz6976 commented on GitHub (Aug 5, 2025):
Thanks for spotting this! I misunderstood how OLLAMA_NUM_PARALLEL works (related: https://github.com/ollama/ollama/issues/4170). It worked after removing OLLAMA_NUM_PARALLEL from envvars. 😄
P.S. Is there a way for ollama itself calculate how many requests it can handle before falling back to CPU? I can't find a optimal OLLAMA_NUM_PARALLEL as it applies to all models and I sometimes need to run different models in parallel
@HuChundong commented on GitHub (Aug 5, 2025):
i have 4x2080ti 22gb, all 88GB, gpt-oss 120b use 10% cpu, ctx is 8k, 88GB is not enough for 120B model?
@russellmm commented on GitHub (Aug 5, 2025):
It was context size for me. On a 5090, I can set the context windows to 32K and the model uses the GPU. Setting to 64K and it switches to the CPU.
@Shawneau commented on GitHub (Aug 5, 2025):
Yeah that works for me too, what's the context window for the model though? Still a bug if over 32K (might not be Ollama bug though might Open WebUI or elsewhere)
@abhinavxd commented on GitHub (Aug 5, 2025):
Yes, it's the context size. It works well with the Ollama UI and CLI (uses GPU).
But when I add this model to GitHub Copilot, the context goes up to 32,768 and it doesn't use the GPU at all.
I got a 4080
@torbwol commented on GitHub (Aug 5, 2025):
It's so weird... With a context of 8192 it utilizes one of my two gpus and says size is 22GB. When increasing the context to 16384 it goes 100% CPU and says size is 13GB. How does this make any sense? Why can't it use both gpus and why can't it use gpus at all when increasing the context size?
@SierraKiloGulf commented on GitHub (Aug 5, 2025):
Same here- team red- 7900XTX. Doesn't matter if using CLI, openWebUI, AnythingLLM and the likes. Windows/Ubuntu
@thedaveCA commented on GitHub (Aug 6, 2025):
As a datapoint:
0.11.0ran gpt-oss:20b on CPU for me,0.11.2on GPU. 7900XTX w/24GB VRAM, reporting 14.8GiB in use.@ZYJZYJZYJ0801 commented on GitHub (Aug 6, 2025):
same as me
NAME SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:120b 67 GB 100% CPU 8192 3 minutes from now
how to fix it??
use other model, can use 100% GPU
@coolbirdzik commented on GitHub (Aug 6, 2025):
Same on me too with 2 A4000
@n0k0de commented on GitHub (Aug 6, 2025):
In my setup with a 5060 Ti 16GB (Ollama + Open WebUI all on Docker), Ollama only offloads 22 out of 24 layers to the GPU, even though there are still 3GB of VRAM available.
The OLLAMA_NUM_PARALLEL variable is set to 1.
Additionally, even when I set num_ctx to 4096 via Open WebUI, the context remains at 8192. Hard to say whether this issue comes from Ollama or Open WebUI.
@Ca-rs-on commented on GitHub (Aug 6, 2025):
FWIW I accidentally pulled the wrong Docker image when upgrading to use gpt-oss and it caused this same problem, if you're running NVIDIA don't pull the rocm tag lol.
@ricardofiorani commented on GitHub (Aug 6, 2025):
Same here
@jessegross commented on GitHub (Aug 6, 2025):
There was a bug in 0.11.2 and below where the memory estimation would become too high for gpt-oss if the model needed to be split across GPU and CPU or multiple GPUs. This would often cause 100% on CPU usage once the model overflows a single GPU.
This is fixed in 0.11.3.
@trdischat commented on GitHub (Aug 6, 2025):
Upgrading to 0.11.3 allowed gpt-oss:20b to load at least partially on the GPU. But the memory consumed by the model more than doubled. With 0.11.2, the model used 13GB of memory, 100% on the CPU. With 0.11.3, the model uses 32GB of memory, split 24%/76% between CPU and GPU. This is just running
ollama run gpt-ossat the command line.I am running Ollama on Ubuntu 20.04 with these environment variable settings:
The server has 2 RTX 3060 for a total of 24GB of VRAM and 96GB of system RAM. Reducing the context length to 2000 brought the memory used by the model down to 19GB (running 100% on GPU), still way more than in Ollama 0.11.2.
Testing with other models, including llama, mistral, qwen3, etc., reveals that all models seem to be using more RAM in 0.11.3 than they were in 0.11.2.
@ZYJZYJZYJ0801 commented on GitHub (Aug 7, 2025):
ollama version is 0.11.3
NAME SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:120b 151 GB 37%/63% CPU/GPU 8192 About a minute from now
GPU: 5000Ada *3
memory: 128G *2
can't use full GPU, how to fix it????
@azomDev commented on GitHub (Aug 7, 2025):
Similar issue here #11688
@alienatedsec commented on GitHub (Aug 7, 2025):
and to follow up on the ollama side
Seems like there is still some space to allocate GPU(s) memory. Some testing as per the below:
OLLAMA_NUM_PARALLEL=1@Jonseed commented on GitHub (Aug 7, 2025):
I'm seeing a similar slowdown on Ollama with my 3060 12gb, where I only get about 4 t/s, which is almost unusable. In LM Studio I'm getting up to 13+ t/s, offloading 20 layers out of 24 (83%). When using Ollama, ollama ps shows it is only using 68% of my gpu, and offloading the rest to cpu, which could account for the slowdown.
@azomDev commented on GitHub (Aug 10, 2025):
Sorry for the spam, I was trying to find all similar issues so far to link them here since this is the earliest issue about what seems to be generally the same problem
@jhsmith409 commented on GitHub (Aug 10, 2025):
I had the same issue in #11731 (listed above as well). 5090 + 5070 Ti for total of 48GB VRAM. Runs 99%+ on CPU and consumes just a small amount of VRAM. When I calculate KV Cache size + model + est. overhead, I think it should easily fit... I pulled the model parameters from the model card directly on HF. My context size is large - 128k. Qwen3:30B fits easily with that context and matches my VRAM calculation for it. So I'm thinking there is a bug remaining in the implementation or there is something different about the model that the way I'm calculating KV Cache side works for Qwen3 but not for GPT-OSS.
@alienatedsec commented on GitHub (Aug 10, 2025):
Not sure what happened recently - some wrong reporting with the latest version
@rick-github commented on GitHub (Aug 10, 2025):
Most of these issues are because the context is too big. Reduce context, reduce VRAM.
@alienatedsec commented on GitHub (Aug 11, 2025):
Attaching logs if relevant. These are from today, but the same
ollama psoutput as of yesterday_ollama_logs.txt
@rick-github commented on GitHub (Aug 11, 2025):
What is wrong with the reporting?
@alienatedsec commented on GitHub (Aug 11, 2025):
@rick-github the CPU usage
Seems the below could also be related, as I also use
OpenWebUI@rick-github commented on GitHub (Aug 11, 2025):
The model is loaded 100% in CPU, which is correct.
@alienatedsec commented on GitHub (Aug 11, 2025):
@rick-github it doesn't feel that way, unless I am missing something.
@rick-github commented on GitHub (Aug 11, 2025):
Could you define "feel"?
@alienatedsec commented on GitHub (Aug 11, 2025):
@rick-github could explain this?
Edit - I don't understand how the model is reported in Ollama as 100% in CPU and at the same time the GPU VRAM (around 80%-90%) is utilised?
@rick-github commented on GitHub (Aug 11, 2025):
What's the output of
@alienatedsec commented on GitHub (Aug 11, 2025):
@rick-github
@rick-github commented on GitHub (Aug 11, 2025):
What's the output of
@alienatedsec commented on GitHub (Aug 11, 2025):
no output
@rick-github commented on GitHub (Aug 11, 2025):
What's the output of
@alienatedsec commented on GitHub (Aug 11, 2025):
now I understand - which process you want?
@rick-github commented on GitHub (Aug 11, 2025):
So you are running in a container?
@alienatedsec commented on GitHub (Aug 11, 2025):
yes
@rick-github commented on GitHub (Aug 11, 2025):
What's the output of the following outside of the container
@alienatedsec commented on GitHub (Aug 11, 2025):
Outside the container
@rick-github commented on GitHub (Aug 11, 2025):
Dump the logs from the container and attach.
@rick-github commented on GitHub (Aug 11, 2025):
What does
ollama psshow?@alienatedsec commented on GitHub (Aug 11, 2025):
_ollama_logs.txt
@rick-github commented on GitHub (Aug 11, 2025):
You have modified the model and set
num_gpu=256. Originally, ollama estimated that no layers would fit on the GPU given the size of the memory graph, so the output ofollama psshows the result of that estimation. When it came time for the runner to allocate layers, the override took precedence and caused the runner to allocate all layers to the GPU. It didn't OOM because you have setGGML_CUDA_ENABLE_UNIFIED_MEMORY, which results in the layers overflowing in to system RAM. While this prevents an OOM, there is a potential performance hit.I would be interested to see the statistics (
ollama run gpt-oss:120b --verbose 'why is the sky blue?') of this setup verus one where you load an unmodified version of the model and let it run in CPU.@alienatedsec commented on GitHub (Aug 11, 2025):
vs OpenWebUI
@alienatedsec commented on GitHub (Aug 11, 2025):
Just ran it unmodified on OpenWebUI but left the context size at 128k. Now it fully loaded to CPU and the performance is not great.
@rick-github commented on GitHub (Aug 11, 2025):
Which is?
@alienatedsec commented on GitHub (Aug 11, 2025):
Still running. It's about a word per second. I'll update this comment when complete.
I will need more time to provide the output.
@ericcurtin commented on GitHub (Aug 11, 2025):
Ollama seem to have turned off GPU for standard ggufs from huggingface, it makes the llama.cpp version of Ollama only use CPU, my advice move to something like docker model runner or llama.cpp . I'm willing to assist with docker model runner. CLI chatbot is just:
docker model run ai/gpt-oss
OpenAI-compatible server is behind:
http://127.0.0.1:12434/engines/llama.cpp/v1
when we turn on TCP in docker model runner.
@rick-github commented on GitHub (Aug 11, 2025):
This is incorrect.
@alienatedsec commented on GitHub (Aug 12, 2025):
ollama run gpt-oss:120b --verbose 'why is the sky blue?'- looks like the context size is 8192OpenWebUI - Context 128k - default GPU offloading
@alienatedsec commented on GitHub (Aug 12, 2025):
OpenWebUI - 128k context - max GPU offloading
another go - that's when the model was already loaded
@alienatedsec commented on GitHub (Aug 12, 2025):
@rick-github
Here is another example for
llama4:16x17b, which seems to report correctly on the CPU/GPU split.@alienatedsec commented on GitHub (Aug 12, 2025):
Recently, I have thrown another GPU into my setup and checked with
nvidia-smifor thegpt-ossmodel to report on it:Here is the most relevant output:
GPU 0: 11 461 / 16 376 MiB (≈70 % used)
GPU 1: 15 770 / 20 475 MiB (≈77 % used)
GPU 2: 11 421 / 16 380 MiB (≈70 % used)
GPU 3: 11 461 / 16 376 MiB (≈70 % used)
GPU 4: 11 461 / 16 376 MiB (≈70 % used)
GPU 5: 9 225 / 16 376 MiB (≈56 % used)
Overall: 70 839 / 102 359 MiB (≈69 % used, 31 % free)
@SHU-red commented on GitHub (Aug 14, 2025):
running heavily on CPU
did use
ollama createto additionally make aversion
Tested all 3 models, official, 32k and 3k
Logs
Logs below all created with my custom model reduced context window to 32 via
Modelfileollama create -f Modelfile gpt-oss:20b_ctx32ksudo docker logs ollama
@alienatedsec commented on GitHub (Aug 14, 2025):
@SHU-red Just a thought that worked for me and was mentioned by @rick-github
As you already have
- GGML_CUDA_ENABLE_UNIFIED_MEMORY=1, can you also set--n-gpu-layers 256instead of current--n-gpu-layers 9?@SHU-red commented on GitHub (Aug 14, 2025):
Hi @alienatedsec
how do i do this?
Is there a option for docker-compose?
@alienatedsec commented on GitHub (Aug 14, 2025):
@SHU-red When you run the interactive mode - found here https://github.com/ollama/ollama/issues/1855#issuecomment-1881719430
@SHU-red commented on GitHub (Aug 14, 2025):
Oh yes! This works! Let me guess: I should have read the above more and there is no way to globally set this and use it with all my other services using ollama via api?
@alienatedsec commented on GitHub (Aug 14, 2025):
https://github.com/ollama/ollama/issues/4850#issuecomment-2176979850
@SHU-red commented on GitHub (Aug 14, 2025):
Sorry, saw this one but do not know what number to set for
OLLAMA_MAX_VRAMto have the same effect as--n_gpu_layers 256@alienatedsec commented on GitHub (Aug 14, 2025):
Try without first, unless it fails to load the model
@SHU-red commented on GitHub (Aug 14, 2025):
OK not sure what you mean
I guess this is in bytes the VRAM of my GPU which is 16GB?
I set it in docker-compose to 10000000 which should be 10GB?
Seems to work! But not sure if this is the solution or if this is stilll set from your "interactive mode" hint
Thanks anyway
@alienatedsec commented on GitHub (Aug 14, 2025):
I don't believe you need
OLLAMA_MAX_VRAMvariable as it would likely use whatever is available anyway. You need=to make it a valid env variable.- OLLAMA_MAX_VRAM=13000000Regardless, what are your stats now?
@SHU-red commented on GitHub (Aug 14, 2025):
OK sorry, i really dont get what you want me to do
Only CPU working again
Seems that one time setting
/set parameter num_gpu 256is the only thing that does the trick and keeps being active until i completely shutdown the container and start again and then i would have to re-set it again right?So after my hard stop-start currently its not working again
@rick-github commented on GitHub (Aug 14, 2025):
OLLAMA_MAX_VRAMis no longer supported, it was a short term workaround that has since been removed.If you want a model that forces all layers on to the GPU:
@SHU-red commented on GitHub (Aug 14, 2025):
Awesome! Thank you!
@alienatedsec commented on GitHub (Aug 16, 2025):
Good news in relation to the latest
v0.11.5-rc2Ollama Docker Logs
It's also around 50% quicker - average 19 vs 29 tokens/s.

nvidia-smi output
@Queracus commented on GitHub (Oct 6, 2025):
They solved all this in latest Ollama. can crank up the context and it uses gpu. problem was with OLLAMA_FLASH_ATTENTION not being suported for oss in ollama. so even on 24gb gpu you could only max lke 32k context before it switched to cpu.
Have to say that 20b model is really smart for suck a small one. And fast as F*** on 3090. only uses aprox. 16GB vram on 256k ocntext
@SHU-red commented on GitHub (Oct 6, 2025):
Im not an expert of this but yeah, sometimes its working
I got a RTX4080 with only 15G of memory
Is there a good setting for the model to run stable (without sometimes not working) and on GPU for my lower available memory?
Right now, sometimes it cant load ...
@jessegross commented on GitHub (Oct 6, 2025):
@SHU-red Can you post the log from a time when it doesn't load?
@SHU-red commented on GitHub (Oct 6, 2025):
Yes sorry, should provide more information:
Prepared models from conversations above:
gpt-oss:20b:
gpt-oss:20b_ctx32k_gpu256:
gpt-oss:20b_ctx32k_gpu256:
gpt-oss:20b_gpu256:
gpt-oss:20b_gpu256:
@jessegross commented on GitHub (Oct 6, 2025):
@SHU-red In the first case, 84% of the model is loaded on the GPU, so it is using the GPU for most of the model with the remainder spilling onto the CPU. However, the CPU quickly becomes the bottleneck. You have a lot of other things running at the same time (see your nvidia-smi output) so shutting some of them down may free up enough VRAM to get more of the model loaded onto the GPU.
I assume that gpu256 means that you set num_gpu to 256 in an effort to force more to load on the GPU. However, you don't have enough VRAM for this, which is why it crashed. That would seem to indicate that the memory management logic is working correctly and you should stick with the default settings.
@SHU-red commented on GitHub (Oct 6, 2025):
@jessegross
Thank you for that
So, the bottom line is my graphics card has to low memory I guess
Should I leave the docker environment variables as is?
@Queracus commented on GitHub (Oct 7, 2025):
I dont understand why you guys mess with the OLLAMA_RUN_PARALLEL=1. i never touched it and it works like a charm.
@SHU-red what happens if you run it trough the ollama GUI? with basic settings. 32k context should fit into 16gb Vram easely. not even a question. if max context fits into about 16-17gb in 20b model
@SHU-red commented on GitHub (Oct 7, 2025):
@Queracus
I gutess its just me multiplying the memory consumption my running too much in parallel ...
@kiliansinger commented on GitHub (Oct 25, 2025):
I have a similar issue when switching from ollama 0.9.3:
It worked with large models and used the GPU RTX 4070 8GB at least offloading some of the data. When upgrading it uses just cpu that models such as Qwen3-Coder-30B-A3B-Instruct-1M-Unsloth:UD-IQ3_XXS becomes unusable.
From the above comments it is not clear to me how to get it to work with newer ollama versions.
@Queracus commented on GitHub (Oct 25, 2025):
if it doesnt fit in your GPU it will jsut load into RAM and run on cpu. thats about it
@kiliansinger commented on GitHub (Oct 25, 2025):
yes indeed the model (13GB) is bigger than 8GB so it does not fit completely into GPU. But compared to 0.9.3 it was working and partially offloading calculations to gpu. It was fast enough to be usable. This feature stopped to work. LMStudio actually also is able to use the model. But Ollama was (0.9.3) actually 30% more efficient. It would be sad to loose this.
@kiliansinger commented on GitHub (Oct 30, 2025):
I wrote a PR to probably also fix this issue: https://github.com/ollama/ollama/pull/12856