mirror of
https://github.com/ollama/ollama.git
synced 2026-05-06 08:02:14 -05:00
[GH-ISSUE #5522] deepseek-coder-v2:236b - Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/...path/to/blob #3449
Open
opened 2026-04-12 14:07:22 -05:00 by GiteaMirror
·
14 comments
No Branch/Tag Specified
main
parth-mlx-decode-checkpoints
dhiltgen/ci
hoyyeva/editor-config-repair
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
hoyyeva/launch-backup-ux
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
brucemacd/download-before-remove
parth/update-claude-docs
parth-anthropic-reference-images-path
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#3449
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @scouzi1966 on GitHub (Jul 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5522
What is the issue?
I've had this issue for a while with earlier version of ollama and latest with and Intel SPR 8480+ and RTX 4090. The num_gpu parameter has been removed from model file so I can no longer reduce layers sent to GPU. I sends 10 and I can't test with 9,8 etc. I can run all other models without any issue.
I have 24 GB of VRAM on my 4090 (nothing else loaded) and 320 GB of main memory. Ubuntu 22.04 and Nvidia driver 550.54.14 and CUDA 12.4
Jul 06 19:23:15 ubuntux ollama[169742]: [GIN] 2024/07/06 - 19:23:15 | 200 | 64.872µs | 127.0.0.1 | HEAD "/"
Jul 06 19:23:15 ubuntux ollama[169742]: [GIN] 2024/07/06 - 19:23:15 | 200 | 18.056618ms | 127.0.0.1 | POST "/api/show"
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.282-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=61 layers.offload=10 layers.split="" memory.available="[23.3 GiB]" memory.required.full="134.5 GiB" memory.required.partial="22.1 GiB" memory.required.kv="9.4 GiB" memory.required.allocations="[22.1 GiB]" memory.weights.total="132.5 GiB" memory.weights.repeating="132.1 GiB" memory.weights.nonrepeating="410.2 MiB" memory.graph.full="642.0 MiB" memory.graph.partial="891.5 MiB"
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.283-04:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama1660031732/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 10 --parallel 1 --port 43475"
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.284-04:00 level=INFO source=sched.go:382 msg="loaded runners" count=1
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.284-04:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding"
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.284-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error"
Jul 06 19:23:15 ubuntux ollama[584320]: INFO [main] build info | build=1 commit="7c26775" tid="140507113369600" timestamp=1720308195
Jul 06 19:23:15 ubuntux ollama[584320]: INFO [main] system info | n_threads=56 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140507113369600" timestamp=1720308195 total_threads=112
Jul 06 19:23:15 ubuntux ollama[584320]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="111" port="43475" tid="140507113369600" timestamp=1720308195
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: loaded meta data with 39 key-value pairs and 959 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 (version GGUF V3 (latest))
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 0: general.architecture str = deepseek2
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 1: general.name str = DeepSeek-Coder-V2-Instruct
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 2: deepseek2.block_count u32 = 60
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 3: deepseek2.context_length u32 = 163840
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 4: deepseek2.embedding_length u32 = 5120
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 5: deepseek2.feed_forward_length u32 = 12288
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 6: deepseek2.attention.head_count u32 = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 7: deepseek2.attention.head_count_kv u32 = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 8: deepseek2.rope.freq_base f32 = 10000.000000
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 9: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 10: deepseek2.expert_used_count u32 = 6
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 11: general.file_type u32 = 2
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 1
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 102400
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 14: deepseek2.attention.q_lora_rank u32 = 1536
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 15: deepseek2.attention.kv_lora_rank u32 = 512
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 16: deepseek2.attention.key_length u32 = 192
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 17: deepseek2.attention.value_length u32 = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 18: deepseek2.expert_feed_forward_length u32 = 1536
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 19: deepseek2.expert_count u32 = 160
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 20: deepseek2.expert_shared_count u32 = 2
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 21: deepseek2.expert_weights_scale f32 = 16.000000
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 22: deepseek2.rope.dimension_count u32 = 64
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 23: deepseek2.rope.scaling.type str = yarn
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 24: deepseek2.rope.scaling.factor f32 = 40.000000
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 25: deepseek2.rope.scaling.original_context_length u32 = 4096
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 26: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 28: tokenizer.ggml.pre str = deepseek-llm
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,102400] = ["!", """, "#", "$", "%", "&", "'", ...
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 100000
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 100001
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 34: tokenizer.ggml.padding_token_id u32 = 100001
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 37: tokenizer.chat_template str = {% if not add_generation_prompt is de...
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 38: general.quantization_version u32 = 2
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - type f32: 300 tensors
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - type q4_0: 658 tensors
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - type q6_K: 1 tensors
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.536-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_vocab: special tokens cache size = 2400
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_vocab: token to piece cache size = 0.6661 MB
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: format = GGUF V3 (latest)
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: arch = deepseek2
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: vocab type = BPE
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_vocab = 102400
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_merges = 99757
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ctx_train = 163840
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd = 5120
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_head = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_head_kv = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_layer = 60
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_rot = 64
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_head_k = 192
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_head_v = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_gqa = 1
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_k_gqa = 24576
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_v_gqa = 16384
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_norm_eps = 0.0e+00
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_logit_scale = 0.0e+00
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ff = 12288
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_expert = 160
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_expert_used = 6
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: causal attn = 1
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: pooling type = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope type = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope scaling = yarn
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: freq_base_train = 10000.0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: freq_scale_train = 0.025
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ctx_orig_yarn = 4096
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope_finetuned = unknown
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_d_conv = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_d_inner = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_d_state = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_dt_rank = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model type = 236B
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model ftype = Q4_0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model params = 235.74 B
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model size = 123.78 GiB (4.51 BPW)
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: general.name = DeepSeek-Coder-V2-Instruct
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: BOS token = 100000 '<|begin▁of▁sentence|>'
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: EOS token = 100001 '<|end▁of▁sentence|>'
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: PAD token = 100001 '<|end▁of▁sentence|>'
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: LF token = 126 'Ä'
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_layer_dense_lead = 1
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_lora_q = 1536
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_lora_kv = 512
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ff_exp = 1536
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_expert_shared = 2
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: expert_weights_scale = 16.0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope_yarn_log_mul = 0.1000
Jul 06 19:23:15 ubuntux ollama[169742]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
Jul 06 19:23:15 ubuntux ollama[169742]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Jul 06 19:23:15 ubuntux ollama[169742]: ggml_cuda_init: found 1 CUDA devices:
Jul 06 19:23:15 ubuntux ollama[169742]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_tensors: ggml ctx size = 0.87 MiB
Jul 06 19:23:16 ubuntux ollama[169742]: time=2024-07-06T19:23:16.992-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding"
Jul 06 19:23:19 ubuntux ollama[169742]: time=2024-07-06T19:23:19.191-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: offloading 10 repeating layers to GPU
Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: offloaded 10/61 layers to GPU
Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: CPU buffer size = 105416.00 MiB
Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: CUDA0 buffer size = 21335.35 MiB
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: n_ctx = 2048
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: n_batch = 512
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: n_ubatch = 512
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: flash_attn = 0
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: freq_base = 10000.0
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: freq_scale = 0.025
Jul 06 19:23:21 ubuntux ollama[169742]: time=2024-07-06T19:23:21.904-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding"
Jul 06 19:23:23 ubuntux ollama[169742]: time=2024-07-06T19:23:23.601-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
Jul 06 19:23:24 ubuntux ollama[169742]: llama_kv_cache_init: CUDA_Host KV buffer size = 8000.00 MiB
Jul 06 19:23:24 ubuntux ollama[169742]: llama_kv_cache_init: CUDA0 KV buffer size = 1600.00 MiB
Jul 06 19:23:24 ubuntux ollama[169742]: llama_new_context_with_model: KV self size = 9600.00 MiB, K (f16): 5760.00 MiB, V (f16): 3840.00 MiB
Jul 06 19:23:24 ubuntux ollama[169742]: llama_new_context_with_model: CUDA_Host output buffer size = 0.41 MiB
Jul 06 19:23:24 ubuntux ollama[169742]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 842.00 MiB on device 0: cudaMalloc failed: out of memory
Jul 06 19:23:24 ubuntux ollama[169742]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 882903040
Jul 06 19:23:24 ubuntux ollama[169742]: llama_new_context_with_model: failed to allocate compute buffers
Jul 06 19:23:25 ubuntux ollama[169742]: llama_init_from_gpt_params: error: failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408'
Jul 06 19:23:26 ubuntux ollama[169742]: time=2024-07-06T19:23:26.314-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding"
Jul 06 19:23:26 ubuntux ollama[584320]: ERROR [load_model] unable to load model | model="/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408" tid="140507113369600" timestamp=1720308206
Jul 06 19:23:26 ubuntux ollama[169742]: terminate called without an active exception
Jul 06 19:23:26 ubuntux ollama[169742]: time=2024-07-06T19:23:26.566-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error"
Jul 06 19:23:26 ubuntux ollama[169742]: time=2024-07-06T19:23:26.817-04:00 level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408'"
Jul 06 19:23:26 ubuntux ollama[169742]: [GIN] 2024/07/06 - 19:23:26 | 500 | 11.766669885s | 127.0.0.1 | POST "/api/chat"
Jul 06 19:23:31 ubuntux ollama[169742]: time=2024-07-06T19:23:31.944-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.127181114 model=/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408
Jul 06 19:23:32 ubuntux ollama[169742]: time=2024-07-06T19:23:32.194-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.376988446 model=/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408
Jul 06 19:23:32 ubuntux ollama[169742]: time=2024-07-06T19:23:32.444-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.626674402 model=/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408
OS
Linux
GPU
Nvidia
CPU
Intel
Ollama version
0.1.48
@olumolu commented on GitHub (Jul 8, 2024):
i have tried 16b on alma linux with xeon processor one motherboard and 16gb main memory i could run deepseek-v2 16b nicely
@scouzi1966 commented on GitHub (Jul 8, 2024):
My issue is with the 236b model. Quite a large difference with the 16b
@Ramzee-S commented on GitHub (Jul 13, 2024):
Sorry not of much help, but i have a similar issue. when i disable my gpu's 2x RTX 3090 i can run the model in main memory (516GB (16Channel x32GB)) and it runs with tolerable speed (2x xeon 8470). Although initial prompt processing takes a while. However when the gpu's are enabled, i get the error.
ollama run deepseek-coder-v2:236b
Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408'
Other models that won't fit in VRAM just run fine partially in vram and partially in ram. But this one does not seem to do it.
ollama version is 0.1.47 enough local disk space too.
Any help would be appreciated.
@scouzi1966 commented on GitHub (Jul 15, 2024):
How do you get Ollama to ignore your GPU? Or how do you disable it on Linux?
@Ramzee-S commented on GitHub (Jul 15, 2024):
By accident my ubuntu nvidia driver got updated, a few days ago, then i needed a reboot to get the nvidia drivers working again. So when i did nvidia-smi nothing was showing up except some message that versions did not match.
Then i tried the deepseek-coder-v2 model and it ran! it was actually quite good. After reboot it stopped working. I tried to replicate it by removing the physical gpu's and then the model works. Then putting back the gpu's it of course did not work again. Trying to achieve a way to disable gpu temporary by some Environment variables, did not succeed yet. So i have no easy fix for this, and am using other models until maybe a new version of ollama will run this model.
Edit OK found a fix from here
You can disable the gpu's on linux (works on ubuntu 22) with
determine the device index using or
nvidia-smibut better:Then to deactivate use the device id instead of the '0000:xx:00.0' string in the following
to activate use: "```
nvidia-smi drain -p 0000:xx:00.0 -m 0" replacing the idea agan.
Now for me deepseek is working again. But of course partial gpu ofloading would be nicer. Because just an an example:
total duration: 3m24.609638164s
load duration: 46.016846ms
prompt eval count: 424 token(s)
prompt eval duration: 26.198928s
prompt eval rate: 16.18 tokens/s
eval count: 525 token(s)
eval duration: 2m58.207292s
eval rate: 2.95 tokens/s
@dhiltgen commented on GitHub (Jul 24, 2024):
You can use
OLLAMA_LLM_LIBRARYto force a CPU based runner (e.g.OLLAMA_LLM_LIBRARY=cpu_avx2)I've also posted a PR #5922 to add a new GPU overhead setting to bring back a viable workaround when the memory predictions are incorrect.
@gsoul commented on GitHub (Aug 27, 2024):
I'm experiencing the same issue as the topic-starter. Is there a chance someone figured out the workaround for now?
@pftg commented on GitHub (Sep 10, 2024):
The same issue for macOS
@gsoul commented on GitHub (Sep 11, 2024):
It's fixed for me now. @pftg try to upgrade to the latest version and play around with OLLAMA_GPU_OVERHEAD env parameter: https://github.com/ollama/ollama/pull/5922
@pftg commented on GitHub (Sep 11, 2024):
@gsoul I tried OLLAMA_GPU_OVERHEAD for
deepseek-v2:236band still haven't found success. I'm using the default version for now. I believe my laptop has insufficient memory for a larger version.@olumolu commented on GitHub (Sep 11, 2024):
How much ram you have?
@pftg commented on GitHub (Sep 11, 2024):
@olumolu 16GB
@olumolu commented on GitHub (Sep 11, 2024):
No with 16 gb you cant even run gemma 27b
128 gb ram minimum with arch without desktop environment and with zswap or vram can barely run that model. To run that 160-196gb ram is recommended yo run that model.
@Ramzee-S commented on GitHub (Sep 11, 2024):
I am quite sure that with 16gb RAM or VRAM you wont be able to run deepseek 236b any productive way. (The model size in ram is 133GB. and a big fast ssd swap will just also be to slow). The issues above also occured with 512GB of ram and 2x rtx 3090 with total 48GB VRAM, and users with 196GB RAM.
The issues above also seem related to this thread on llama.cpp.
https://github.com/ggerganov/llama.cpp/discussions/8520
Running deepseek 236gb in llama.cpp directly also gave issues, and i think these were related to some of the ollama issues some are experiencing. When running llama.cpp by default a fixed fraction of the model size was used as a context size multiplier.
Which resulted in very high memory allocation when loading/starting the model, which failed in some cases. Basically you could have enough memory for the model, but not for the default allocated context size. If smaller context window was manually allocated then things worked with llama.cpp. I would not be surprized if setting lower context defaults in ollama will fix some of the issues. However some other issues "things working without cpu, and not with gpu", seems a seperate different issue.