mirror of
https://github.com/ollama/ollama.git
synced 2026-05-06 16:11:34 -05:00
[GH-ISSUE #4427] ollama can't run qwen:72b, error msg ""gpu VRAM usage didn't recover within timeout #64802
Closed
opened 2026-05-03 18:48:38 -05:00 by GiteaMirror
·
29 comments
No Branch/Tag Specified
main
dhiltgen/ci
parth-launch-plan-gating
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#64802
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @changingshow on GitHub (May 14, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4427
Originally assigned to: @dhiltgen on GitHub.
What is the issue?
I have already downloaded qwen:7b, but when i run
ollama run qwen:7b,got this errorError: timed out waiting for llama runner to start:, in the server.log have this msggpu VRAM usage didn't recover within timeoutOS
Windows
GPU
Nvidia
CPU
Intel
Ollama version
ollama version is 0.1.37
@chrisoutwright commented on GitHub (May 19, 2024):
I also get it (with smaller Params) when running RTX 2080 TI OR GTX 1060 with codeqwen:chat and codegamma:instruct for Win10.
It stays in RAM but will have to copy to GPU RAM everytime after one chat POST.
The GPU RAM is not exceeding on them, so not sure why it timesout every time.
@dhiltgen commented on GitHub (May 21, 2024):
This was fixed in 0.1.38 via pr #4430. If you're still seeing problems after upgrading, please share your server log and I'll reopen.
@pamanseau commented on GitHub (May 24, 2024):
@dhiltgen I have the same issue with 0.1.38 with Linux
ollama.log
@dhiltgen commented on GitHub (May 25, 2024):
@pamanseau from the logs you shared, it looks like the client gave up before the model finished loading, and since the client request was canceled, we canceled the loading of the model. Are you using our CLI, or are you calling the API? If you're calling the API, what timeout are you setting in your client?
@kirkster96 commented on GitHub (Jun 1, 2024):
@dhiltgen
I also have the same issue using:
ollama/ollama:latest on Docker
I notice that this occurs after I successfully run a prompt and then let it idle.
When I come back, I have to wait for the
llama runner started in...message.ollama.log
Edit:
Turns out that I did not specify a keep-alive and the default is 5 minutes!
Never mind! 😁
@pamanseau @dhiltgen
How do I keep a model loaded in memory or make it unload immediately?
@pamanseau commented on GitHub (Jun 3, 2024):
We don't set a specific timeout. using the API or the WebUI.
Perhaps the default timeout is too small.
@axel7083 commented on GitHub (Jun 4, 2024):
I am facing the same issue in version
v0.1.41when trying to setPARAMETER num_gpu 33. Got the following@dhiltgen commented on GitHub (Jun 4, 2024):
@axel7083 look earlier in your logs. My suspicion is the GPU runner crashed (perhaps 33 layers was too much for your GPU), we fell back to CPU, and there's a minor bug where we mistakenly try to verify VRAM recovery on unload even though the runner was CPU based.
@dhiltgen commented on GitHub (Jun 4, 2024):
@pamanseau can you try to repro with the CLI to rule out client timeouts?
@abstract-entity commented on GitHub (Jun 5, 2024):
Hello got the same here with a RTX 4090 GPU
@om35 commented on GitHub (Jun 5, 2024):
hello, same issue 0.1.41
ollama run mixtral:8x7b
Error: llama runner process has terminated: exit status 127
@dhiltgen commented on GitHub (Jun 6, 2024):
@om35
/usr/local/cuda/lib64/libcudart.so.11.0: file too shortsounds like your local cuda install might be corrupt.@abstract-entity it's a little hard to tell, but it seems like the subprocess is crashing. Can you try with debug enabled so we can see a little more detail?
then trigger a model load, and assuming it crashes, share that server.log.
@pamanseau commented on GitHub (Jun 7, 2024):
What I found out is that NGINX INGRESS is causing this disconnection with the API so as you mentioned the Ollama stopped loading the model and caused this error.
If I connect the WebUI directly in the ClusterIP or NodePort then it's working.
Unless I configure with the Configmap NGINX keepalive timeout but then this is applied to all the cluster not just Ollama ingress so it has implications for other services of the cluster.
The helm chart is deploying Ingress but should look at GATEWAY API or find a way to keep the client connected.
@ghost commented on GitHub (Jun 16, 2024):
Hi, this doesn't happen to me when running ollama as root directly in a shell, but it happens when I start ollama as a service (regardless of the user):
But somehow:
amnesia λ ~/ sudo ROCR_VISIBLE_DEVICES=0 HSA_OVERRIDE_GFX_VERSION="10.3.0" OLLAMA_DEBUG=1 ollama serveWorks fine and I can chat without issue. Here's my service file, please note I have tried with both the ollama user and the root user (and the ollama user is properly configured/in render & video group):
Both in the shell & run as a service they report using the same GPU (id=0, 6700XT):
level=INFO source=amd_linux.go:71 msg="inference compute" id=0 library=rocm compute=gfx1031 driver=0.0 name=1002:73df total="12.0 GiB" available="12.0 GiB"@dhiltgen commented on GitHub (Jun 18, 2024):
@pulpocaminante the upcoming version 0.1.45 (rc2 currently available) will report the GPU env vars in the log at startup which should help you troubleshoot the settings to figure out which one isn't getting passed in as expected.
@Tai-Pham-2002 commented on GitHub (Aug 29, 2024):
when i run: ollama run aiden_lu/minicpm-v2.6:Q4_K_M
got this error:
Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failed
this my log:
Aug 29 17:35:21 ai-stg ollama[777926]: time=2024-08-29T17:35:21.555Z level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failed"
Aug 29 17:35:21 ai-stg ollama[777926]: [GIN] 2024/08/29 - 17:35:21 | 500 | 1.011699134s | 127.0.0.1 | POST "/api/chat"
Aug 29 17:35:26 ai-stg ollama[777926]: time=2024-08-29T17:35:26.718Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.163553423 model=/usr/share/ollama/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1
Aug 29 17:35:27 ai-stg ollama[777926]: time=2024-08-29T17:35:27.072Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.51679798 model=/usr/share/ollama/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1
Aug 29 17:35:27 ai-stg ollama[777926]: time=2024-08-29T17:35:27.424Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.869100664 model=/usr/share/ollama/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1
help me, please
@Leon-Sander commented on GitHub (Sep 23, 2024):
@dhiltgen
I have no problems on linux but get this error on windows. My application uses ollama as llm server, and many users work on windows and experience this error. I also added the "keep_alive": -1 into the api request but it didnt change the result. I am using the latest docker image.
ollama-1 | 2024/09/23 10:28:22 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
ollama-1 | time=2024-09-23T10:28:22.189Z level=INFO source=images.go:753 msg="total blobs: 20"
ollama-1 | time=2024-09-23T10:28:22.307Z level=INFO source=images.go:760 msg="total unused blobs removed: 0"
ollama-1 | time=2024-09-23T10:28:22.411Z level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
ollama-1 | time=2024-09-23T10:28:22.416Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12]"
ollama-1 | time=2024-09-23T10:28:22.416Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
ollama-1 | time=2024-09-23T10:28:23.178Z level=INFO source=types.go:107 msg="inference compute" id=GPU-3c819917-39e0-af79-6f9b-db1d227a2872 library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3070" total="8.0 GiB" available="6.9 GiB"
ollama-1 | time=2024-09-23T10:28:56.963Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single
GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-3c819917-39e0-af79-6f9b-db1d227a2872 parallel=4 available=7444889600 required="6.2 GiB"
ollama-1 | time=2024-09-23T10:28:56.963Z level=INFO source=server.go:103 msg="system memory" total="7.7 GiB" free="6.4 GiB"
free_swap="2.0 GiB"
ollama-1 | time=2024-09-23T10:28:56.965Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
ollama-1 | time=2024-09-23T10:28:56.969Z level=INFO source=server.go:388 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 41745"
ollama-1 | time=2024-09-23T10:28:56.970Z level=INFO source=sched.go:449 msg="loaded runners" count=1
ollama-1 | time=2024-09-23T10:28:56.970Z level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
ollama-1 | time=2024-09-23T10:28:56.971Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
ollama-1 | INFO [main] build info | build=10 commit="eaf151c" tid="140164179275776" timestamp=1727087337
ollama-1 | INFO [main] system info | n_threads=6 n_threads_batch=6 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA =
0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140164179275776" timestamp=1727087337 total_threads=12
ollama-1 | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="41745" tid="140164179275776" timestamp=1727087337
ollama-1 | time=2024-09-23T10:28:57.223Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
ollama-1 | llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
ollama-1 | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama-1 | llama_model_loader: - kv 0: general.architecture str = llama
ollama-1 | llama_model_loader: - kv 1: general.type str = model
ollama-1 | llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
ollama-1 | llama_model_loader: - kv 3: general.finetune str = Instruct
ollama-1 | llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
ollama-1 | llama_model_loader: - kv 5: general.size_label str = 8B
ollama-1 | llama_model_loader: - kv 6: general.license str = llama3.1
ollama-1 | llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta",
"pytorch", "llam...
ollama-1 | llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
ollama-1 | llama_model_loader: - kv 9: llama.block_count u32 = 32
ollama-1 | llama_model_loader: - kv 10: llama.context_length u32 = 131072
ollama-1 | llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
ollama-1 | llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
ollama-1 | llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
ollama-1 | llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
ollama-1 | llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
ollama-1 | llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
ollama-1 | llama_model_loader: - kv 17: general.file_type u32 = 2
ollama-1 | llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
ollama-1 | llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
ollama-1 | llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
ollama-1 | llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
ollama-1 | llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
ollama-1 | llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama-1 | llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ
ĠĠ", "...
ollama-1 | llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
ollama-1 | llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
ollama-1 | llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
ollama-1 | llama_model_loader: - kv 28: general.quantization_version u32 = 2
ollama-1 | llama_model_loader: - type f32: 66 tensors
ollama-1 | llama_model_loader: - type q4_0: 225 tensors
ollama-1 | llama_model_loader: - type q6_K: 1 tensors
ollama-1 | llm_load_vocab: special tokens cache size = 256
ollama-1 | llm_load_vocab: token to piece cache size = 0.7999 MB
ollama-1 | llm_load_print_meta: format = GGUF V3 (latest)
ollama-1 | llm_load_print_meta: arch = llama
ollama-1 | llm_load_print_meta: vocab type = BPE
ollama-1 | llm_load_print_meta: n_vocab = 128256
ollama-1 | llm_load_print_meta: n_merges = 280147
ollama-1 | llm_load_print_meta: vocab_only = 0
ollama-1 | llm_load_print_meta: n_ctx_train = 131072
ollama-1 | llm_load_print_meta: n_embd = 4096
ollama-1 | llm_load_print_meta: n_layer = 32
ollama-1 | llm_load_print_meta: n_head = 32
ollama-1 | llm_load_print_meta: n_head_kv = 8
ollama-1 | llm_load_print_meta: n_rot = 128
ollama-1 | llm_load_print_meta: n_swa = 0
ollama-1 | llm_load_print_meta: n_embd_head_k = 128
ollama-1 | llm_load_print_meta: n_embd_head_v = 128
ollama-1 | llm_load_print_meta: n_gqa = 4
ollama-1 | llm_load_print_meta: n_embd_k_gqa = 1024
ollama-1 | llm_load_print_meta: n_embd_v_gqa = 1024
ollama-1 | llm_load_print_meta: f_norm_eps = 0.0e+00
ollama-1 | llm_load_print_meta: f_norm_rms_eps = 1.0e-05
ollama-1 | llm_load_print_meta: f_clamp_kqv = 0.0e+00
ollama-1 | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama-1 | llm_load_print_meta: f_logit_scale = 0.0e+00
ollama-1 | llm_load_print_meta: n_ff = 14336
ollama-1 | llm_load_print_meta: n_expert = 0
ollama-1 | llm_load_print_meta: n_expert_used = 0
ollama-1 | llm_load_print_meta: causal attn = 1
ollama-1 | llm_load_print_meta: pooling type = 0
ollama-1 | llm_load_print_meta: rope type = 0
ollama-1 | llm_load_print_meta: rope scaling = linear
ollama-1 | llm_load_print_meta: freq_base_train = 500000.0
ollama-1 | llm_load_print_meta: freq_scale_train = 1
ollama-1 | llm_load_print_meta: n_ctx_orig_yarn = 131072
ollama-1 | llm_load_print_meta: rope_finetuned = unknown
ollama-1 | llm_load_print_meta: ssm_d_conv = 0
ollama-1 | llm_load_print_meta: ssm_d_inner = 0
ollama-1 | llm_load_print_meta: ssm_d_state = 0
ollama-1 | llm_load_print_meta: ssm_dt_rank = 0
ollama-1 | llm_load_print_meta: ssm_dt_b_c_rms = 0
ollama-1 | llm_load_print_meta: model type = 8B
ollama-1 | llm_load_print_meta: model ftype = Q4_0
ollama-1 | llm_load_print_meta: model params = 8.03 B
ollama-1 | llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
ollama-1 | llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct
ollama-1 | llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
ollama-1 | llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
ollama-1 | llm_load_print_meta: LF token = 128 'Ä'
ollama-1 | llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
ollama-1 | llm_load_print_meta: max token length = 256
ollama-1 | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ollama-1 | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ollama-1 | ggml_cuda_init: found 1 CUDA devices:
ollama-1 | Device 0: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes
ollama-1 | llm_load_tensors: ggml ctx size = 0.27 MiB
ollama-1 | time=2024-09-23T10:33:57.075Z level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.00 - "
ollama-1 | [GIN] 2024/09/23 - 10:33:57 | 500 | 5m0s | 172.18.0.3 | POST "/api/chat"
app-1 | {'error': 'timed out waiting for llama runner to start - progress 0.00 - '}
ollama-1 | time=2024-09-23T10:34:02.206Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.131559562 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
ollama-1 | time=2024-09-23T10:34:02.455Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.380732173 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
ollama-1 | time=2024-09-23T10:34:02.716Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.641832884 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
@dhiltgen commented on GitHub (Sep 24, 2024):
@Leon-Sander you can se OLLAMA_LOAD_TIMEOUT to adjust the timeout if your system needs more than 5m to load the model. If that doesn't help get you past it, please open a new issue with your server logs so we can investigate.
@Leon-Sander commented on GitHub (Sep 27, 2024):
@dhiltgen got it, thanks.
Do you have an Idea why the loading/offloading on gpu takes that much time on windows? On linux llama3.1 is loaded in 10 seconds, but on windows it takes me 5 minutes on the same computer. I have good hardware.
Inference time seems to be pretty much the same as on linux, just the model loading ins unbearable.
Edit: The answer seems to be the wsl2 based docker image I/O back to the NTFS filesystem can be slow #6006
I just tested it without docker and the loading is as fast as on linux.
@JoffreyLemeryAncileo commented on GitHub (Nov 20, 2024):
@dhiltgen, thank for all your activity !
For my company i do some testing to see on which instance type on Azure we gonna deploy our model.
I installed ollama version is 0.4.1 on ubuntu 22.04 - NC64asT4v3 series
I installed Cuda 12.6 and i have all the GPU available.
As for many, i uptaded my ollama.service like this to make it accessible by API call :
`[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
Environment="OLLAMA_HOST=0.0.0.0"
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"
Environment="OLLAMA_LOG_LEVEL=debug"
Environment="OLLAMA_GPU=1"
[Install]
WantedBy=default.target`
When i run gemna2;27B for example, it is quite long depiste 4GPUs.
I dig a bit and see than just 4/47 of layers are offloads, the CPU explose and GPU are almost not used.
==> I duplicate the model with a Modelfile and control the offloading with : PARAMETER num_gpu 47
I also made it inside gemma2:27B using / set_parameter num_gpu 47
It increase a bit the performance. I want now to validate with API call from local laptop. However, no matter the model i call or the parameter, it is always a small offloading 4/47
I understand from other ticket you repleid that ollam aims to calculate itself the vram. Which is strange because with the 4 GPU, i have enough vRAM to load all layers and run them theorically.
So i wonder it the calculation works really and if there is a way to by pass it using python request payload
A point a noticied : the logs at the begining always tell me that the GPU vRAM didn't recover
LOGS :
Nov 20 10:17:57 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:17:57.777Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=6.381064304 model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc Nov 20 10:17:59 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:17:59.348Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=7.952129341 model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc Nov 20 10:18:00 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:00.890Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=9.494299231 model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc Nov 20 10:18:02 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:02.431Z level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc library=cuda parallel=1 required="46.3 GiB" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.039Z level=INFO source=server.go:105 msg="system memory" total="432.9 GiB" free="427.0 GiB" free_swap="0 B" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.040Z level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=4 layers.model=47 layers.offload=4 layers.split=1,1,1,1 memory.available="[14.5 GiB 14.5 GiB 14.5 GiB 14.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="99.8 GiB" memory.required.partial="46.3 GiB" memory.required.kv="44.9 GiB" memory.required.allocations="[11.6 GiB 11.6 GiB 11.6 GiB 11.6 GiB]" memory.weights.total="58.6 GiB" memory.weights.repeating="57.7 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="8.6 GiB" memory.graph.partial="8.6 GiB" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.042Z level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3490198133/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 128000 --batch-size 512 --n-gpu-layers 4 --threads 64 --parallel 1 --tensor-split 1,1,1,1 --port 45325" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.043Z level=INFO source=sched.go:449 msg="loaded runners" count=1 Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.043Z level=INFO source=server.go:562 msg="waiting for llama runner to start responding" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.044Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: time=2024-11-20T10:18:04.069Z level=INFO source=runner.go:863 msg="starting go runner"[...]
Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: ggml_cuda_init: found 4 CUDA devices: Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 0: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 1: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 2: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: Device 3: Tesla T4, compute capability 7.5, VMM: yes Nov 20 10:18:04 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: ggml ctx size = 1.14 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: offloading 4 repeating layers to GPU Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: offloaded 4/47 layers to GPU Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CPU buffer size = 14898.60 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA0 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA1 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA2 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llm_load_tensors: CUDA3 buffer size = 303.82 MiB Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: n_ctx = 128000 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: n_batch = 512 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: n_ubatch = 512 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: flash_attn = 0 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: freq_base = 10000.0 Nov 20 10:18:07 NC64asT4v3Ubuntu2204 ollama[337512]: llama_new_context_with_model: freq_scale = 1 Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA_Host KV buffer size = 42000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA0 KV buffer size = 1000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA1 KV buffer size = 1000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA2 KV buffer size = 1000.00 MiB Nov 20 10:18:43 NC64asT4v3Ubuntu2204 ollama[337512]: llama_kv_cache_init: CUDA3 KV buffer size = 1000.00 MiBIs y GPU config wromng which create vRAM unrelease or the offloading 4/47 is really the maximum, or the num_gpu automatic calculation is wrong ? Or maybe all of them ?
Cheers and thanks !
@yanzhenxu99 commented on GitHub (Feb 10, 2025):
same error inside docker, log:
time=2025-02-10T09:53:08.738+08:00 level=INFO source=server.go:104 msg="system memory" total="1007.5 GiB" free="956.4 GiB" free_swap="0 B" time=2025-02-10T09:53:08.739+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=50 layers.model=62 layers.offload=37 layers.split=19,18 memory.available="[78.9 GiB 78.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="244.4 GiB" memory.required.partial="153.6 GiB" memory.required.kv="95.3 GiB" memory.required.allocations="[77.7 GiB 75.8 GiB]" memory.weights.total="224.7 GiB" memory.weights.repeating="224.0 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="7.1 GiB" memory.graph.partial="7.1 GiB" time=2025-02-10T09:53:08.741+08:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v11_avx/ollama_llama_server runner --model /nfs-userfs/xuyanzhen/deepseek/blobs/sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6 --ctx-size 20480 --batch-size 512 --n-gpu-layers 50 --threads 127 --parallel 5 --tensor-split 19,18 --port 57150" time=2025-02-10T09:53:08.743+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-02-10T09:53:08.743+08:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-02-10T09:53:08.744+08:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-02-10T09:53:08.758+08:00 level=INFO source=runner.go:936 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA A800-SXM4-80GB, compute capability 8.0, VMM: yes Device 1: NVIDIA A800-SXM4-80GB, compute capability 8.0, VMM: yes time=2025-02-10T09:53:08.792+08:00 level=INFO source=runner.go:937 msg=system info="CUDA : USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=127 time=2025-02-10T09:53:08.793+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:57150" time=2025-02-10T09:53:08.995+08:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_load_model_from_file: using device CUDA0 (NVIDIA A800-SXM4-80GB) - 80727 MiB free llama_load_model_from_file: using device CUDA1 (NVIDIA A800-SXM4-80GB) - 80727 MiB free llama_model_loader: loaded meta data with 52 key-value pairs and 1025 tensors from /nfs-userfs/xuyanzhen/deepseek/blobs/sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 BF16 llama_model_loader: - kv 3: general.quantized_by str = Unsloth llama_model_loader: - kv 4: general.size_label str = 256x20B llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 6: deepseek2.block_count u32 = 61 llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168 llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432 llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128 llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128 llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8 llama_model_loader: - kv 15: deepseek2.leading_dense_block_count u32 = 3 llama_model_loader: - kv 16: deepseek2.vocab_size u32 = 129280 llama_model_loader: - kv 17: deepseek2.attention.q_lora_rank u32 = 1536 llama_model_loader: - kv 18: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 19: deepseek2.attention.key_length u32 = 192 llama_model_loader: - kv 20: deepseek2.attention.value_length u32 = 128 llama_model_loader: - kv 21: deepseek2.expert_feed_forward_length u32 = 2048 llama_model_loader: - kv 22: deepseek2.expert_count u32 = 256 llama_model_loader: - kv 23: deepseek2.expert_shared_count u32 = 1 llama_model_loader: - kv 24: deepseek2.expert_weights_scale f32 = 2.500000 llama_model_loader: - kv 25: deepseek2.expert_weights_norm bool = true llama_model_loader: - kv 26: deepseek2.expert_gating_func u32 = 2 llama_model_loader: - kv 27: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 28: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 29: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 30: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 31: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 33: tokenizer.ggml.pre str = deepseek-v3 llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�... llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e... llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 128815 llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 41: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 42: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 43: general.quantization_version u32 = 2 llama_model_loader: - kv 44: general.file_type u32 = 24 llama_model_loader: - kv 45: quantize.imatrix.file str = DeepSeek-R1.imatrix llama_model_loader: - kv 46: quantize.imatrix.dataset str = /training_data/calibration_datav3.txt llama_model_loader: - kv 47: quantize.imatrix.entries_count i32 = 720 llama_model_loader: - kv 48: quantize.imatrix.chunks_count i32 = 124 llama_model_loader: - kv 49: split.no u16 = 0 llama_model_loader: - kv 50: split.tensors.count i32 = 1025 llama_model_loader: - kv 51: split.count u16 = 0 llama_model_loader: - type f32: 361 tensors llama_model_loader: - type q4_K: 190 tensors llama_model_loader: - type q5_K: 116 tensors llama_model_loader: - type q6_K: 184 tensors llama_model_loader: - type iq2_xxs: 6 tensors llama_model_loader: - type iq1_s: 168 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 819 llm_load_vocab: token to piece cache size = 0.8223 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = deepseek2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 129280 llm_load_print_meta: n_merges = 127741 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 163840 llm_load_print_meta: n_embd = 7168 llm_load_print_meta: n_layer = 61 llm_load_print_meta: n_head = 128 llm_load_print_meta: n_head_kv = 128 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 192 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 24576 llm_load_print_meta: n_embd_v_gqa = 16384 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 18432 llm_load_print_meta: n_expert = 256 llm_load_print_meta: n_expert_used = 8 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = yarn llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 0.025 llm_load_print_meta: n_ctx_orig_yarn = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 671B llm_load_print_meta: model ftype = IQ1_S - 1.5625 bpw llm_load_print_meta: model params = 671.03 B llm_load_print_meta: model size = 130.60 GiB (1.67 BPW) llm_load_print_meta: general.name = DeepSeek R1 BF16 llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: EOT token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 128815 '<|PAD▁TOKEN|>' llm_load_print_meta: LF token = 131 'Ä' llm_load_print_meta: FIM PRE token = 128801 '<|fim▁begin|>' llm_load_print_meta: FIM SUF token = 128800 '<|fim▁hole|>' llm_load_print_meta: FIM MID token = 128802 '<|fim▁end|>' llm_load_print_meta: EOG token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_layer_dense_lead = 3 llm_load_print_meta: n_lora_q = 1536 llm_load_print_meta: n_lora_kv = 512 llm_load_print_meta: n_ff_exp = 2048 llm_load_print_meta: n_expert_shared = 1 llm_load_print_meta: expert_weights_scale = 2.5 llm_load_print_meta: expert_weights_norm = 1 llm_load_print_meta: expert_gating_func = sigmoid llm_load_print_meta: rope_yarn_log_mul = 0.1000 time=2025-02-10T09:58:08.799+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.00 - " [GIN] 2025/02/10 - 09:58:08 | 500 | 5m0s | 127.0.0.1 | POST "/api/generate" time=2025-02-10T09:58:13.930+08:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.130783519 model=/nfs-userfs/xuyanzhen/deepseek/blobs/sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6 time=2025-02-10T09:58:14.226+08:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.426999476 model=/nfs-userfs/xuyanzhen/deepseek/blobs/sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6 time=2025-02-10T09:58:14.527+08:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.727853776 model=/nfs-userfs/xuyanzhen/deepseek/blobs/sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6nvidia-smi gpu usage:
@thistlillo commented on GitHub (Mar 27, 2025):
I get this message "gpu VRAM usage didn't recover within timeout" with almost all the models available today. This one was generated while interacting with Qwen2.5
gpu VRAM usage didn't recover within timeout
@marksverdhei commented on GitHub (Mar 31, 2025):
I'm getting this issue exclusively with Gemma-3-27b. QWQ 32b works, gemma-3-12b works. 50GB vram. Pretty strange...
@radishlee commented on GitHub (Apr 3, 2025):
user@ubuntu:~$ systemctl status ollama
● ollama.service - Ollama Service
Loaded: loaded (/etc/systemd/system/ollama.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2025-04-02 13:10:43 CST; 20h ago
Main PID: 1152 (ollama)
Tasks: 30 (limit: 74767)
Memory: 16.6G
CPU: 8h 41min 34.743s
CGroup: /system.slice/ollama.service
├─ 1152 /usr/local/bin/ollama serve
└─2256926 /usr/local/bin/ollama runner --model /data/ollama/.ollama/models/blobs/sha256-60cfdbde0472c3b850493551288a152f0858a0d1974964d6925c2b908035db76 --ctx-size 16384 --batch-size 512 --n-gpu-layers 31 --threads 8 --parallel 4 -->
4月 03 10:01:44 ubuntu ollama[1152]: llama_kv_cache_init: CPU KV buffer size = 7680.00 MiB
4月 03 10:01:44 ubuntu ollama[1152]: llama_init_from_model: KV self size = 7680.00 MiB, K (f16): 3840.00 MiB, V (f16): 3840.00 MiB
4月 03 10:01:44 ubuntu ollama[1152]: llama_init_from_model: CPU output buffer size = 1.62 MiB
4月 03 10:01:44 ubuntu ollama[1152]: llama_init_from_model: CPU compute buffer size = 1088.01 MiB
4月 03 10:01:44 ubuntu ollama[1152]: llama_init_from_model: graph nodes = 966
4月 03 10:01:44 ubuntu ollama[1152]: llama_init_from_model: graph splits = 1
4月 03 10:01:44 ubuntu ollama[1152]: time=2025-04-03T10:01:44.733+08:00 level=INFO source=server.go:619 msg="llama runner started in 2.52 seconds"
4月 03 10:06:36 ubuntu ollama[1152]: time=2025-04-03T10:06:36.330+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.195915848 model=/data/ollama/.ollama/models/blobs/sha256-a9d3d622b517bcf8150341f0>
4月 03 10:06:36 ubuntu ollama[1152]: time=2025-04-03T10:06:36.536+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.40198618 model=/data/ollama/.ollama/models/blobs/sha256-a9d3d622b517bcf8150341f07>
4月 03 10:06:36 ubuntu ollama[1152]: time=2025-04-03T10:06:36.785+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.6509068540000005 model=/data/ollama/.ollama/models/blobs/sha256-a9d3d622b517bcf81>
user@ubuntu:~$ ollama --version
ollama version is 0.6.2
ollama running in jetson ,cuda 12.4.131 jetpack6.2 cudnn1.0 ubuntu 22.04
LLM is deepseek-llm:7b and embdding model is quentinz/bge-large-zh-v1.5:latest
@talsan74 commented on GitHub (Apr 3, 2025):
I had the same issue while running llama3.2-vision:90b. I freed up more than 50GB of storage, and it worked fine
@yorjaggy commented on GitHub (Jul 22, 2025):
Actually this save my day. I just double checked with df -h the amount of free space and after cleaning at least 200gb (im running DeepSeek-V3-Q2_K_XS:latest) the model start working properly 🚀
@andtewfox commented on GitHub (Jul 22, 2025):
@pluberd commented on GitHub (Aug 4, 2025):
Same here. My Version is 0.10.1 and I habe 24/96GB VRAM/RAM
While most LLMs can be loaded, some don't work:
jobautomation/OpenEuroLLM-German:latest
gemma3n:e4b-it-q4_K_M
gemma3n:e4b-it-q8_0
@Oruli commented on GitHub (Sep 22, 2025):
I get this same issue on ollama version is 0.12.0, was it ever resolved?
It only happens after the model is unloaded when keep_alive is reached, so basically I can have one chat, come back 10minutes later and ollama doesn't load to GPU but CPU.