mirror of
https://github.com/ollama/ollama.git
synced 2026-05-06 16:11:34 -05:00
Closed
opened 2026-05-03 22:44:10 -05:00 by GiteaMirror
·
23 comments
No Branch/Tag Specified
main
dhiltgen/ci
parth-launch-plan-gating
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
question
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#65795
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @wxletter on GitHub (Jul 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6008
Originally assigned to: @dhiltgen on GitHub.
What is the issue?
When I run "ollama run llama3.1:70b", I can see that 22.9/24 GB of dedicated GPU memory is used, and 18.9/31.9 GB of shared GPU memory is used (it's in Chinese so I did the translation).
From "server.log" I can see "offloaded 42/81 layers to GPU", and when I'm chatting with llama3.1 the response is very slow, "ollama ps" shows:
Memory should be enough to run this model, then why only 42/81 layers are offloaded to GPU, and ollama is still using CPU? Is there a way to force ollama to use GPU? Server log attached, let me know if there's any other info that could be helpful.
OS
Windows11
GPU
Nvidia RTX 4090
CPU
Intel i7 13700KF
RAM
64GB
Ollama version
0.3.0
server.log
@wxletter commented on GitHub (Jul 27, 2024):
@mxmp210
Thanks for your reply, however I don't have this issue running llama3.1:70b, the model can be loaded succesfully but the response is very slow since Ollama is running on CPU. I have 64 GB RAM.
@wxletter commented on GitHub (Jul 27, 2024):
@dhiltgen could you help to take a look at this issue? I'm not sure if it'd be rude to @ you like this, I just saw you're helping people with other problems, and I think you may help me on this, thanks in advance!
@rick-github commented on GitHub (Jul 27, 2024):
ollama is using the GPU: almost all (21.1 of 24G) is being used for the model. But the model is larger than the available dedicated VRAM, it needs a total 39.3G of RAM. So some it has to spill into system RAM, which is 18.2G of shared GPU memory.
@wxletter commented on GitHub (Jul 27, 2024):
I understand that the dedicated 24 GB VRAM is not enough to load the model so shared GPU memory is used, although the "shared GPU memory" is actually RAM, it should be considered as "VRAM" just speed is lower than real VRAM. So do you mean that no matter it's shared GPU memory or just RAM, as long as part of the layers are offloaded to RAM, Ollama will use CPU + GPU?
@rick-github commented on GitHub (Jul 27, 2024):
I don't have a deep knowledge of Nvidia devices/drivers or how llama.ccp uses them, but generally the problem with RAM limited peripherals is memory bandwidth. PCIe devices can access system RAM, but the speed is lower than the speed at which the CPU can access it. If we take current top of the line tech, a x16 PCIe4 bus has about 32GBps simplex data transfer rate while a DDR5 based CPU/RAM system has about 64GBps. (GPUs have a higher bandwidth with their local memory due the increased width of the bus, GDDR6 is typically > 800GBps while HBM is measured in TBps). So while it's technically possible to have a PCI based device access system RAM, it's usually more efficient to allow the CPU to process the data in system RAM and the PCI device process data in its own RAM.
Somebody with more knowledge of Nvidia cards and llama.cpp could provide more insight.
@wxletter commented on GitHub (Jul 28, 2024):
Thanks very much for the explaination, I don't have this kind of knowledge, I just thought that since it's called "shared GPU memory", if it can only be used in the same way as normal RAM, then it's meaningless; like the "virtual memory" is just file on hard drive, it's treated like "RAM" (at least in some cases) not hard drive.
If "shared GPU memory" can be recognized as VRAM, even it's spead is lower than real VRAM, Ollama should use 100% GPU to do the job, then the response should be quicker than using CPU + GPU. I'm not sure if I'm wrong or whether Ollama can do this.
@rick-github commented on GitHub (Jul 28, 2024):
Whether CPU+GPU or GPU only is faster or slower depends on the where the bottleneck is, memory bandwidth or compute. Either way, it's not an ollama issue, it's a llama.cpp issue. Follow up on https://github.com/ggerganov/llama.cpp/issues/6743.
@wxletter commented on GitHub (Jul 28, 2024):
I subscribed to this issue, it's not the same as mine. On my PC when Ollama is running llama3.1 70b both VRAM and shared GPU memory are used, however most of the time it's CPU doing the job and GPU is barely used (judging from performance monitor). I agree with @alirezanet that even some layers are offloaded to shared GPU memory, CPU should not be used mainly. In my case it takes more than 1 min before Ollama starting to respond (to a simple chat like just saying "hello"), and I can only get about 1 word per second, the performance is too bad.
@rick-github commented on GitHub (Jul 28, 2024):
Stable Diffusion users found shared memory impacted processing speed so much that Nvidia added an option to turn it off. If you have time, it would be interesting if you could try this and see if anything changes. Having read up a little bit on shared memory, it's not clear to me why the driver is reporting any shared memory usage at all. llama.cpp has only got 42 layers of the model loaded into VRAM, and if llama.cpp is using CPU for the other 39 layers, then there should be no shared GPU RAM, just VRAM and system RAM. It would be interesting if you could post a screen cap of the GPU and CPU usage for the entire time that llama.cpp is doing inference, to see if the load switches completely between GPU and CPU or if it uses a bit from both at the same time.
@wxletter commented on GitHub (Jul 28, 2024):
I tried to set "prefer no system fallback" for ollama app.exe, ollama.exe and ollama_llama_server.exe, restarted Ollama, restarted PC, with no result... when I run "ollama run llama3.1:70b" it's still using 20+ GB of shared GPU memory. I took a screenshot after I sent "hello" to the model, and this is the CPU and GPU usage when llama3.1 is thinking about how to respond. CPU usage is about 50%, and GPU usage is about 10% all the time, sometimes GPU usasge will raise to about 30% then immediately drop to lower than 10%.
@dhiltgen commented on GitHub (Jul 29, 2024):
As others have pointed out, ollama (and the underlying llama.cpp library) utilize dedicated VRAM on the GPU for inference. Once that memory is near fully allocated, the remaining portions of the model are loaded into system memory and inference is performed using the CPU.
What Windows does with the shared memory is perform a paging algorithm where pages of memory are swapped back and forth between system RAM and GPU VRAM, and while this does allow some apps to overflow VRAM, the performance impact to inference would be significant. It's better to leverage the CPU for performing inference of the portion of the model that doesn't fit within VRAM in parallel with the GPU processing its portion of the model, instead of thrashing memory pages back and forth.
@wxletter commented on GitHub (Jul 29, 2024):
@dhiltgen Thanks very much for the explaination, I understand now that it's better to leverage CPU to handle the layers loaded into RAM together with GPU to handle the layers loaded in VRAM. I have one concer left: in my case about half the layers are loaded into RAM and the other half loaded into VRAM, when GPU and CPU are performing inference together the CPU usage is about 40% and most of the time the GPU usage is about 10%, hardly reaches 30% then drop to 10% immediately, is there anything I can do to make most use of both GPU and CPU to get better inference performance?
@wxletter commented on GitHub (Aug 1, 2024):
Update - today I updated Ollama to version 0.3.2, llama3.1 70B loads faster (about 25 sec) than before (Ollama ver 0.3.0, more than 1min), CPU utilization is higher (about 70%) but GPU utilizaiton is still low (about 20%) when inferencing. 40/81 laysers are loaded into VRAM.
@dhiltgen commented on GitHub (Aug 1, 2024):
It's possible our thread count might not be optimal on your system - see #2496 - you can experiment with setting different values for
num_threadto try to optimize performance.@wxletter commented on GitHub (Aug 2, 2024):
My CPU is Intel 13700KF, it has 16 cores and 24 threads, I tried to use "/set parameter num_thread 24" and "/set parameter num_thread 16" to set the parameter but only get about 40% CPU usage, can't even make it to 70% as when I updated Ollama yesterday, and the GPU usage is still low - about 10% to 20%. I'll try with some other numbers to see if it can be better. Is there any parameter that I can use to improve GPU utilization?
@xiaohan815 commented on GitHub (Sep 20, 2024):
mybe you need two 4090 GPUs to run the 70B model?估计你需要2张4090才能跑70b不卡
@michelle-chou25 commented on GitHub (Oct 29, 2024):
Yes, 48G GPU at least
@kripper commented on GitHub (Nov 16, 2024):
I'm experiencing the same symptom here: CPU is reporting high load, but only GPU should be used (using VRAM and shared memory). Maybe there is a bug in llama.cpp?
@icemagno commented on GitHub (Dec 4, 2024):
I have Ollama for windows, RTX 4060 and ollama keeps insisting on using CPU and RAM instead my GPU. It is very disapointing because I spent a fortune buying this gpu. Many have explained various things about PCI, buses and RAM performance, etc... So ... what is the point of having GPU then ?
@rick-github commented on GitHub (Dec 4, 2024):
ollama will use the GPU if it's able to. If you would like your issue debugged, open a new ticket and add server logs.
@DJMo13 commented on GitHub (Jan 24, 2025):
Use smaller models that actually are able to fit in your gpus vram, 13b models with quantization need at least a gpu with 12 gb vram and 32b models need at least 24 gb vram. 70b Models need datacenter gpus or two consumer ones and if you put a model too big on your gpu then it has to fall back to use the cpu as well and even 1 token per second is fast if you use a llama 70b model on your cpu...
@cyberluke commented on GitHub (Feb 16, 2025):
Just use LM Studio, it loads models better and have much more options to set up offloading and num layer :-/
@99sono commented on GitHub (Apr 21, 2025):
Intuitively, I always thought it would be a no-brainer that the price to pay to swap model data from RAM to GPU VRAM would be much lower than using the CPU for any heavy math like in big language models. With so many more cores on the GPU (like the RTX 3090’s 10,000+ vs. a CPU’s 16 or so), I figured moving data to the GPU would be worth it, even if it’s a bit slow.
Here’s the analysis from Grok 3 on this matter for the Llama 3.1 70B model (q4_0, ~40-50 GB) on an RTX 3090 with 24 GB VRAM, based on your setup (42/81 layers in VRAM, 39 in shared memory):
Summary for Too long did not read (TLDR)
Surprisingly, swapping layers isn’t much faster than using the CPU, because moving data over PCIe is so slow. For your RTX 3090, sticking with q4_0 and maybe trying a smaller model like Llama 3.1 8B (~5 GB, fits in VRAM) might be better. You could also try setting
OLLAMA_MAX_VRAMto push more layers to VRAM, but be careful of crashes. Thanks for posting this—it’s really interesting! What’s your CPU, and have you tried smaller models?Full Grok explanation:
I understand your confusion about the layer-swapping costs and the comparison between swapping layers versus computing on the CPU. The original computation was a bit unclear, and I’ll make it explicit by calculating the time per token for three scenarios involving a 4-bit quantized Llama 3.1 70B model (~40-50 GB) on an RTX 3090 (24 GB VRAM). The scenarios will compare:
I’ll also rewrite your GitHub post in a simpler, non-expert tone, following your requested structure: (A) your intuitive thought about swapping versus CPU computation, and (B) Grok 3’s analysis. The post will reflect the clarified model size (~40-50 GB for q4_0) and address the issue (Ollama Issue #6008).
Clarified Analysis: Time per Token for Three Scenarios
Model and Hardware Assumptions
Scenario 1: Ideal (All Layers in Infinite VRAM)
[
\frac{1-2 , \text{TFLOPs}}{35.6 , \text{TFLOPs/s}} \approx 28-56 , \text{ms}
]
Scenario 2: Dynamic Swapping (39 Layers Swapped per Token)
[
\frac{0.43 , \text{GB}}{32 , \text{GB/s}} \approx 13.4 , \text{ms}
]
[
39 \times 13.4 , \text{ms} \approx 522.6 , \text{ms}
]
[
39 \times 13.4 , \text{ms} \times 2 \approx 1045.2 , \text{ms}
]
For simplicity, assume VRAM can hold the new layers temporarily (e.g., by overwriting KV cache), so ~522.6 ms for swapping in.
[
522.6 , \text{ms (swapping)} + 50-100 , \text{ms (compute)} \approx 572.6-622.6 , \text{ms}
]
Scenario 3: Static Offloading (39 Layers on CPU)
[
\frac{0.5-1 , \text{TFLOPs}}{35.6 , \text{TFLOPs/s}} \approx 14-28 , \text{ms}
]
[
\frac{0.5-1 , \text{TFLOPs}}{1-2 , \text{TFLOPs/s}} \approx 250-1000 , \text{ms}
]
[
25-50 , \text{ms (GPU)} + 500-1000 , \text{ms (CPU)} \approx 525-1050 , \text{ms}
]
Comparison
Key Insight: Dynamic swapping (Scenario 2) is not a bargain compared to static offloading (Scenario 3). Swapping 39 layers costs ~523 ms, which is comparable to or worse than the CPU’s 500-1000 ms for computing 39 layers. The GPU compute advantage (50-100 ms for all layers) is negated by PCIe latency. Static offloading is simpler and often faster, especially if the CPU is reasonably performant.