mirror of
https://github.com/ollama/ollama.git
synced 2026-05-06 08:02:14 -05:00
Closed
opened 2026-04-28 04:04:28 -05:00 by GiteaMirror
·
84 comments
No Branch/Tag Specified
main
parth-mlx-decode-checkpoints
dhiltgen/ci
hoyyeva/editor-config-repair
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
hoyyeva/launch-backup-ux
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
brucemacd/download-before-remove
parth/update-claude-docs
parth-anthropic-reference-images-path
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#47519
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @orlyandico on GitHub (Jan 1, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1756
Originally assigned to: @dhiltgen on GitHub.
I recently put together an (old) physical machine with an Nvidia K80, which is only supported up to CUDA 11.4 and Nvidia driver 470. All my previous experiments with Ollama were with more modern GPU's.
I found that Ollama doesn't use the GPU at all. I cannot find any documentation on the minimum required CUDA version, and if it is possible to run on older CUDA versions (e.g. Nvidia K80, V100 are still present on cloud, e.g. G2 and P2 on AWS) and there's lots of K80's all over ebay.
EDIT: looking through the logs, it appears that the GPU's are being seen:
Jan 1 20:22:43 thinkstation-s30 ollama[911]: 2024/01/01 20:22:43 llama.go:300: 24762 MB VRAM available, loading up to 162 GPU layers
Jan 1 20:22:43 thinkstation-s30 ollama[911]: 2024/01/01 20:22:43 llama.go:436: starting llama runner
Jan 1 20:22:43 thinkstation-s30 ollama[911]: 2024/01/01 20:22:43 llama.go:494: waiting for llama runner to start responding
Jan 1 20:22:43 thinkstation-s30 ollama[911]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
Jan 1 20:22:43 thinkstation-s30 ollama[911]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
Jan 1 20:22:43 thinkstation-s30 ollama[911]: ggml_init_cublas: found 3 CUDA devices:
Jan 1 20:22:43 thinkstation-s30 ollama[911]: Device 0: Tesla K80, compute capability 3.7
Jan 1 20:22:43 thinkstation-s30 ollama[911]: Device 1: Tesla K80, compute capability 3.7
Jan 1 20:22:43 thinkstation-s30 ollama[911]: Device 2: NVIDIA GeForce GT 730, compute capability 3.5
and
Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: ggml ctx size = 0.11 MiB
Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: using CUDA for GPU acceleration
Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: mem required = 70.46 MiB
Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: offloading 32 repeating layers to GPU
Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: offloading non-repeating layers to GPU
Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: offloaded 33/33 layers to GPU
Jan 1 20:34:20 thinkstation-s30 ollama[911]: llm_load_tensors: VRAM used: 3577.61 MiB
but....
Jan 1 20:34:21 thinkstation-s30 ollama[911]: CUDA error 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: no kernel image is available for execution on the device
Jan 1 20:34:21 thinkstation-s30 ollama[911]: current device: 0
Jan 1 20:34:21 thinkstation-s30 ollama[911]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: !"CUDA error"
Jan 1 20:34:22 thinkstation-s30 ollama[911]: 2024/01/01 20:34:22 llama.go:451: 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: no kernel image is available for execution on the device
Jan 1 20:34:22 thinkstation-s30 ollama[911]: current device: 0
Jan 1 20:34:22 thinkstation-s30 ollama[911]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:7801: !"CUDA error"
Jan 1 20:34:22 thinkstation-s30 ollama[911]: 2024/01/01 20:34:22 llama.go:459: error starting llama runner: llama runner process has terminated
@Cybervet commented on GitHub (Jan 2, 2024):
What is your Linux Kernel? I think 6+ kernels don't support a lot of older nvidia cards.
@orlyandico commented on GitHub (Jan 2, 2024):
Kernel is 6+ and the setup is supported. I was able to get PyTorch working with CUDA - albeit PyTorch 2.0.1 only since that is the last version that supports CUDA 11.4
The error 209 "no kernel image is available for execution on the device" is for CUDA, not the Linux kernel. Basically the Ollama distribution doesn't have a compiled kernel (via nvcc) for CUDA 11.4 (not even sure if that is supported, if I build from source).
@yolobnb commented on GitHub (Jan 8, 2024):
This is also the same case for me. I am using a Quadro K2200. It is recognized along with the computing capability. As soon as I pull a model, the error shows up and Ollama terminates.
@dhiltgen commented on GitHub (Jan 9, 2024):
The K80 is Compute Capability 3.7, which at present isn't supported by our CUDA builds. (see https://developer.nvidia.com/cuda-gpus for the mapping table)
Based on our current build setup, Compute Capability 6.0 is the minimum we'll support. We had some bugs on detection and fallback logic in 0.1.18, which should be resolved in 0.1.19 so that if we detect older than 6.0 we'll fallback to CPU.
There's a possibility we may be able to support 5.x cards by compiling llama.cpp with different flags and dynamically loading the right library variant on the fly based on what we discover, but that support hasn't been merged yet.
I'm not sure yet if we can compile support going all the way back into the 3.7 series, but we'll keep this ticket tracking that.
@datag commented on GitHub (Jan 10, 2024):
I'd love to see that change. Owner of old
GeForce GTX 960Mon amd64 Linux here. Version 0.1.18 stopped working while 0.1.17 has been working.@dhiltgen commented on GitHub (Jan 10, 2024):
Can you clarify? Was 0.1.17 working on the GPU, or falling back to CPU mode?
Also to clarify, the GTX 960M is a Compute Capability 5.0 card, which we're tracking in a different ticket now #1865
@datag commented on GitHub (Jan 10, 2024):
You're right, I guess it was falling back to CPU mode, but I'm unsure how to read the logs correctly.
The issue you mentioned seems to be the issue I was having. Version 0.1.19 fixes it. Sorry for the noise and thanks!
@dhiltgen commented on GitHub (Jan 10, 2024):
At startup the server log will report information about attempting to discover GPU information, and in the case of CUDA cards, will report the compute capability. If we don't detect a supported GPU, we report that we're falling back to CPU mode. In the near future we'll be adding refinements to support multiple variants for a given GPU (and CPU) to try to leverage modern capabilities when detected, but also be able to fallback to a baseline that works for older GPUs/CPUs.
@nejib1 commented on GitHub (Jan 15, 2024):
Hello, same case here, I have Nvidia K80, ollama works only in CPU :(
@sunzh231 commented on GitHub (Jan 16, 2024):
Hi, same case here, I have Nvidia M40, ollama works only in CPU in docker container :(
@dhiltgen commented on GitHub (Jan 18, 2024):
The M40 is a Compute Capability 5.2 card, so it's covered by #1865
@dhiltgen commented on GitHub (Jan 20, 2024):
We're using CUDA v11 to compile our official builds. Digging around a bit, it looks like CUDA v11 no longer supports Compute Capability 3.0, but I am able to get nvcc to target 3.5 cards.
I'll work on some mod's to the way we do our builds so that someone with a 3.0 card and older CUDA toolkit might be able to build it on their own from source, but I think we may be able to get 3.5+ support into the official builds.
@orlyandico commented on GitHub (Jan 20, 2024):
The K80 I referenced in my original post supports up to CUDA 11.4 which is
the last version it will ever support, since it has been end-of-lifed.
On Sat, Jan 20, 2024 at 8:11 PM Daniel Hiltgen @.***>
wrote:
@dhiltgen commented on GitHub (Jan 20, 2024):
PR #2116 lays foundation to be able to experiment with CC 3.5 support. I'm not sure if we'll need other flags to get it working, or simply adding "35" to the list of
CMAKE_CUDA_ARCHITECTURES.@orlyandico commented on GitHub (Jan 25, 2024):
EDIT: I am aware that there are "resizable BAR" issues around the use of the Tesla P40, and my hardware is so ancient that it does not support resizable BAR. However, PyTorch runs just fine and I can load e.g. BigBird into the P40 and do inference. Note that my PyTorch install is 2.0.1 and also worked on the K80. PyTorch itself warns that the GT730 (CC 3.5) is not supported, and CC 3.7 is the lowest supported on 2.0.1 (which is a few years old at this point).
I replaced the K80 with a P40, which is a Compute Capability 6.1 card. The card appears in nvidia-smi and is detected in the Ollama logs:
...
Jan 25 15:26:21 thinkstation-s30 ollama[919]: ggml_init_cublas: found 2 CUDA devices:
Jan 25 15:26:21 thinkstation-s30 ollama[919]: Device 0: Tesla P40, compute capability 6.1
Jan 25 15:26:21 thinkstation-s30 ollama[919]: Device 1: NVIDIA GeForce GT 730, compute capability 3.5
...
However I still get the "... no kernel..." error, it appears to be using Device 1! it's not very clear how to force the use of Device 0 (when I was using the K80 it was being properly selected) - I tried the CUDA_VISIBLE_DEVICES environment variable which had no effect.
...
Jan 25 15:26:26 thinkstation-s30 ollama[919]: llama_new_context_with_model: total VRAM used: 2258.20 MiB (model: 1456.19 MiB, context: 802.00 MiB)
Jan 25 15:26:26 thinkstation-s30 ollama[919]: CUDA error 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: no kernel image is available for execution on the device
Jan 25 15:26:26 thinkstation-s30 ollama[919]: current device: 1
Jan 25 15:26:26 thinkstation-s30 ollama[919]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: !"CUDA error"
Jan 25 15:26:27 thinkstation-s30 ollama[919]: 2024/01/25 15:26:27 llama.go:451: 209 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: no kernel image is available for execution on the device
Jan 25 15:26:27 thinkstation-s30 ollama[919]: current device: 1
Jan 25 15:26:27 thinkstation-s30 ollama[919]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8075: !"CUDA error"
...
@dhiltgen commented on GitHub (Jan 25, 2024):
@orlyandico that's unfortunate CUDA_VISIBLE_DEVICES didn't do the trick. I'll try to see if I can setup a test rig similar to your setup and try to find a way to ignore the unsupported card.
@orlyandico commented on GitHub (Jan 25, 2024):
I've also gotten DiffusionPipeline and models from HuggingFace working, it is a bit odd that
torch.cuda.device_count()
sometimes returns 1 (and only enumerates the P40) and sometimes 2 (also enumerates the GT730)
@dhiltgen commented on GitHub (Jan 27, 2024):
I've got a PR up to add support, but I'm a little concerned people might actually see a performance hit not improvement by transitioning to GPU instead of CPU for these old cards.
Folks with these old cards - if you want to give the change a try and build from source and let me know how the performance compares before/after that would be helpful to weigh when/if we merge the PR.
@felipecock commented on GitHub (Jan 28, 2024):
Hi @dhiltgen
I have a GeForce 920M GPU which has a CC 3.5
I'd like to participate in that test, please guide me how could I compile it on Ubuntu 22.04 and how can I benchmark this test with and without the GPU.
I appreciate your contributions and appreciate your efforts to support these older GPUs.
@dhiltgen commented on GitHub (Jan 28, 2024):
Thanks @felipecock
Check out https://github.com/ollama/ollama/blob/main/docs/development.md for instructions, and if you get stuck, join the community on Discord for an added hand.
@nejib1 commented on GitHub (Jan 28, 2024):
Hello @dhiltgen
Is there any possibility of getting Ollama to work with the Nvidia K80 in the next few days, or should we abandon this idea?
@dhiltgen commented on GitHub (Jan 29, 2024):
@nejib1 if you apply the changes of my PR as a patch to the repo and build from source, it will run on a K80 GPU. Instructions on building from source are here
Given the concerns we have that this might actually result in a performance regression not improvement for users, we're going to hold off merging this until we get more performance data.
@nejib1 commented on GitHub (Jan 29, 2024):
Thank you very much, I'll try it
@tbendien commented on GitHub (Jan 29, 2024):
I am having similiar issues trying to run Ollama Web UI with my RTX A4000 16GB GPU.
When I run standard Ollama, it uses my GPU just fine. When I install Ollama Web UI, I get errors (from a full clean Ubuntu install, with all NVIDIA drivers and container toolkit installed).
Ollama Web UI commands
gtadmin@gtaiws3:~/ollama-webui$ docker-compose -f docker-compose.yaml -f docker-compose.gpu.yaml up
Traceback (most recent call last):
File "/usr/bin/docker-compose", line 33, in
sys.exit(load_entry_point('docker-compose==1.29.2', 'console_scripts', 'docker-compose')())
File "/usr/lib/python3/dist-packages/compose/cli/main.py", line 81, in main
command_func()
File "/usr/lib/python3/dist-packages/compose/cli/main.py", line 200, in perform_command
project = project_from_options('.', options)
File "/usr/lib/python3/dist-packages/compose/cli/command.py", line 60, in project_from_options
return get_project(
File "/usr/lib/python3/dist-packages/compose/cli/command.py", line 157, in get_project
return Project.from_config(
File "/usr/lib/python3/dist-packages/compose/project.py", line 135, in from_config
service_dict['device_requests'] = project.get_device_requests(service_dict)
File "/usr/lib/python3/dist-packages/compose/project.py", line 375, in get_device_requests
raise ConfigurationError(
TypeError: ConfigurationError.init() takes 2 positional arguments but 3 were given
When I just run the CPU only yaml, everything works fine....
gtadmin@gtaiws3:~/ollama-webui$ docker-compose -f docker-compose.yaml up
ollama is up-to-date
ollama-webui is up-to-date
Attaching to ollama, ollama-webui
ollama | Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
ollama | Your new public key is:
ollama |
ollama | ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEi4k2WvzJB4+o3PMQTvhq1M2ci6JnEYfDUiH6Dl6k+k
ollama |
ollama | 2024/01/29 02:09:40 images.go:857: INFO total blobs: 0
ollama | 2024/01/29 02:09:40 images.go:864: INFO total unused blobs removed: 0
ollama | 2024/01/29 02:09:40 routes.go:950: INFO Listening on [::]:11434 (version 0.1.22)
ollama | 2024/01/29 02:09:40 payload_common.go:106: INFO Extracting dynamic libraries...
ollama | 2024/01/29 02:09:42 payload_common.go:145: INFO Dynamic LLM libraries [cpu cuda_v11 cpu_avx rocm_v5 rocm_v6 cpu_avx2]
ollama | 2024/01/29 02:09:42 gpu.go:94: INFO Detecting GPU type
ollama | 2024/01/29 02:09:42 gpu.go:236: INFO Searching for GPU management library libnvidia-ml.so
ollama | 2024/01/29 02:09:42 gpu.go:282: INFO Discovered GPU libraries: []
ollama | 2024/01/29 02:09:42 gpu.go:236: INFO Searching for GPU management library librocm_smi64.so
ollama | 2024/01/29 02:09:42 gpu.go:282: INFO Discovered GPU libraries: []
ollama | 2024/01/29 02:09:42 cpu_common.go:11: INFO CPU has AVX2
ollama | 2024/01/29 02:09:42 routes.go:973: INFO no GPU detected
ollama-webui | start.sh: 3: Bad substitution
ollama-webui | INFO: Started server process [1]
ollama-webui | INFO: Waiting for application startup.
ollama-webui | INFO: Application startup complete.
ollama-webui | INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
@dhiltgen commented on GitHub (Jan 29, 2024):
@tbendien an RTX A4000 is a modern GPU with Compute Capability 8.6. Let's keep this ticket focused on support for much older cards with CC 3.5 and 3.7. Folks can help troubleshoot on Discord, or you can open a new issue.
@felipecock commented on GitHub (Jan 31, 2024):
@dhiltgen
I've perfomed a test with and without the GPU:
Only CPU Intel Core i7 5500U CPU - ollama:main branch (time seems to be in ns)
{"model":"llama2:latest","created_at":"2024-01-31T22:24:33.848173925Z","message":{"role":"assistant","content":""},"done":true,"total_duration":330940056957,"load_duration":3067744651,"prompt_eval_count":457,"prompt_eval_duration":227370727000,"eval_count":157,"eval_duration":100501014000}
GPU GeForce 920M @ 4GB (It only reached up to 33% GPU about the first minute, Dedicated Memory doesn't seem to be used) + CPU Intel Core i7 5500U CPU (Reached 100% most of time) - ollama:cc_3.5 branch
llama_print_timings: load time = 2001.26 ms
llama_print_timings: sample time = 168.67 ms / 175 runs ( 0.96 ms per token, 1037.54 tokens per second)
llama_print_timings: prompt eval time = 110295.28 ms / 154 tokens ( 716.20 ms per token, 1.40 tokens per second)
llama_print_timings: eval time = 198530.10 ms / 174 runs ( 1140.98 ms per token, 0.88 tokens per second)
llama_print_timings: total time = 309092.12 ms
It was a bit faster with GPU, although it was not used at 100% as I expected, IDK if that is ok for this model.
@orlyandico commented on GitHub (Feb 1, 2024):
Found the reason, ollama.service was launching from systemd and so wasn't picking up CUDA_VISIBLE_DEVICES from the environment.
Still leaves the question as to why the CC 3.5 device was being selected when it isn't the first device and is not supported. Ollama probably should have logic to select only the supported CUDA devices on a multi-device host..
@dhiltgen commented on GitHub (Feb 1, 2024):
@orlyandico we don't yet have logic to automatically detect and bypass unsupported cards in a multi-gpu setup when one isn't supported but others are.
@felipecock can you clarify your scenario? Are you attempting to load a model that can't fit entirely in VRAM and thus are getting a split between CPU/GPU? For apples-to-apples performance comparison, I'd try to get metrics from a model that fits entirely in the GPU so we're not getting thrown off by I/O bottlenecks or GPU stalling waiting for CPU.
@felipecock commented on GitHub (Feb 2, 2024):
@dhiltgen, I've performed a test in a newer machine (13th Gen Intel(R) Core(TM) i9-13900H, 2600 Mhz, 14 Core(s), 20 Logical Processor(s) 64GB RAM + NVIDIA RTX 2000 Ada Generation Laptop GPU) and I realized that CPU is used in a more extensive way than the GPU, despite Ollama said the GPU was to be used:
So, I think that the expected behavior is to use the CPU in some part of the process, that cannot be parallelized (I believe), and it results in a wider CPU usage rather of the GPU.
I'm not an expert in this, then I could be wrong. 😕
@dhiltgen commented on GitHub (Feb 2, 2024):
@felipecock I'm not quite sure what your question is. It looks like that GPU has 12G of VRAM, so you'll be able to run larger models entirely on the GPU than a typical CC 3.5 or 3.7 card. We're drifting a bit off-topic for this issue, but if the model doesn't fit in VRAM, then some amount of processing is done on the CPU, and often this can result in poor performance as the GPU stalls waiting for the CPU to keep up.
The current state of this issue is I have a PR up which would enable support for these older cards, but we're not sure if we're going to merge it yet or not, as we're concerned it could be a performance hit for many users given these older cards aren't particularly well suited for LLM work.
@orlyandico commented on GitHub (Feb 2, 2024):
I gave up on the K80 and got a P40 because.. even though the K80 is 2 x 12GB, it
doesn't support smaller data types! you're stuck with fp32 and
fp64, even a 7B model won't fit in the 12GB of RAM!
On Fri, Feb 2, 2024 at 9:36 PM Daniel Hiltgen @.***>
wrote:
@felipecock commented on GitHub (Feb 3, 2024):
Thank you for your reply, that super laptop is not mine, my laptop is the
Intel 5500 with GeForce 920M that has a CC 3.5.
I've just did the test in that "super" laptop to validate if even with a
"supported" CUDA GPU some high-demand processes are performed in CPU, and
it does.
your help
I've published my results, but if you want to guide me to perform a better
test, out some specific scenarios to test, I'll be happy to do so.
Thank you again for your time and gentleness!
On Fri, Feb 2, 2024, 4:36 PM Daniel Hiltgen @.***>
wrote:
@orlyandico commented on GitHub (Feb 3, 2024):
@felipecock if the model doesn't fit entirely in the GPU RAM, then only some of the layers are stored and evaluated on the GPU, the rest on the CPU. So even if some layers evaluate faster on the GPU, the inference stalls waiting for the CPU, and you'll get barely-better-than-CPU performance.
The GeForce 920M only has 2GB of RAM, so only the tiniest of models would fit entirely on it. Actually, NONE of the existing models in the model library would fit entirely.
I just tried "dolphin-phi" which is 1.6GB but it consumes 2.5GB of GPU memory on my machine.
I did a quick search on HuggingFace for a sub-1GB GGUF model that can be imported, but found nothing.
Even if the entire model fits in the GPU, there is still some activity on the CPU. I just tested a specific LLM on Ollama and while it is generating, the CPU usage is 100% and the GPU usage is 100% - but my machine has 6 cores, so only 1 CPU core is being used. I believe this is copying data to and from the GPU memory. It is not actually doing inference on the CPU.
I have observed that when doing pure CPU inference, ALL of the CPU cores are used (e.g. I see 600% CPU) so in a mixed setup where some layers are on the GPU and other layers on the CPU, if Ollama uses all available CPU cores (it does not if all layers are on the GPU) then there might still be some benefit to offloading some layers to the GPU.
@felipecock commented on GitHub (Feb 7, 2024):
Thank you, @orlyandico, for your reply.
@j-d-salinger commented on GitHub (Feb 29, 2024):
I'm willing to test the K80... I have 2 of them on one machine (= 4 gpus total, since the k80 is two 12gb gpus stuck together.) I have 128 gb of ram on the machine.
Trying it on the mistral, llama2, and dolphin-phi models... I've successfully merged your patch and built it, but I can't quite tell if it is using the GPU? at least
nvidia-smidoesn't show that it's using it. It does detect it in the logs: INFO: Cuda Compute 3.7 detected. But CPU is using 1,500% intop...Correct me if I'm wrong, but even "run"-ing the model can be benefitted by the GPU? Last time I worked with A.I., I only ran (evaluated) models on CPU, and would only use the GPU for training. Right now I'm only testing it on "run" (chat) and it is showing CPU-only
@nejib1 commented on GitHub (Feb 29, 2024):
This linux command refresh nvidia-smi continuously, you can run something an check the % of GPU :
watch -n 1 nvidia-smi@j-d-salinger commented on GitHub (Feb 29, 2024):
Yes, it says "No running process found". All are at 0% utilization, 0 MB of memory used
@orlyandico commented on GitHub (Feb 29, 2024):
if cpu usage is 1500% and there is no GPU memory usage, then it’s not
running on the GPU.
On Thu, 29 Feb 2024 at 18:47, j-d-salinger @.***> wrote:
@j-d-salinger commented on GitHub (Feb 29, 2024):
Yes, how do I fix that?
@orlyandico commented on GitHub (Feb 29, 2024):
If you could post the logs here (e.g. similar to my original post at the very top) that would be useful.
@j-d-salinger commented on GitHub (Mar 1, 2024):
Here is some startup info, before chatting:
@j-d-salinger commented on GitHub (Mar 1, 2024):
Here is some logs when I try to chat. Let me know if you need more, I had to download the model in between
@j-d-salinger commented on GitHub (Mar 2, 2024):
That issue may have been caused by pulling the repo from master, and building from that. Rather than "checking out" the latest release (v0.1.27 as of today), applying the patch, and testing on the K80.
Unfortunately I only solved this once I had de-installed the K80s for a P40. If i have time i'll go back and test the K80s with the latest release.
@dhiltgen commented on GitHub (Mar 25, 2024):
I'm not sure when I'll have a chance to get back to this one, so this would be a great community contribution if someone's up for it.
The rough design is to modify the linux and windows gen_* scripts here so that by setting some env var before calling
go generate ./...we'd add the CMAKE_CUDA_ARCHITECTURES for 35 and 37. Then we'd need to refactor the CudaComputeMin in gpu.go so that it's easy to override at build time. (Look to how we set the version.go setting in the build script ) Then doc it all so it's easy for folks with these older cards to install an older cuda version that still supports 3.5, and build from source. It might look something like this:@langstonmeister commented on GitHub (Aug 4, 2024):
I just built the files, like here:
and it worked, but I am still not getting any action on the GPU, just from CPU. I am going to try again from the beginning with the latest from the git and I'll report back.
I am trying to build for a k40. I have 2 that I would like to use, but so far no luck.
edit: I recompiled it all from the main source, and I'm getting the same errors.
Even though CUDA 3.5 was supposed to be supported.
@langstonmeister commented on GitHub (Aug 5, 2024):
Okay, I was able to get something to work! I'm not there yet, but hopefully someone smarter than me can fill in some gaps.
Turns out that the CMAKE_CUDA_ARCHITECTURES is still passing the compute versions to the compiler. What I was able to get working was this command:
I'm still having the issue where it tells me that my GPU is too old, but it is showing that there is about 64MB in the vram.
One annoying thing is that I keep having to install the cuda-toolkit each time I want to compile, but then reinstall the utils-470 when I want to try running it. It would seem that this generation did not have nvidia-smi available along with the cuda-toolkit. I could see users being frustrated by this.
@dhiltgen commented on GitHub (Aug 6, 2024):
@langstonmeister check out #2233 for some minor changes required to get things working on CC 3.5 and 3.7 GPUs.
@orlyandico commented on GitHub (Aug 6, 2024):
Possibly extremely dumb question/observation, sorry if I missed something..
I saw an error in the github page
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.183.06: nvml vram init failure: 9
That seems to say that a later version than the nvidia 470 driver and
matching CUDA toolkit have been installed.
As far as I can recall the CUDA Compute 3.5/3.7 cards only support the
470 driver and CUDA toolkit 11.4 (nothing later).
This becomes a further challenge because CUDA toolkit 11.4 is not
supported or available on anything later than Ubuntu 20.04
On Tue, Aug 6, 2024 at 1:24 AM Daniel Hiltgen @.***>
wrote:
@langstonmeister commented on GitHub (Aug 10, 2024):
Yeah, it has to be 11.4. It will not compile with an older version of the CUDA toolkit. I was able to install the CUDA 11.4 and driver-470 on Ubuntu 24.04 and it's working so far. I needed to use the driver provided by Ubuntu and install just the CUDA stuff from NVIDIA.
My previous logs were from a system that I messed up pretty bad... ended up reinstalling the OS and starting from scratch with just the Ubuntu driver and CUDA 11.4 and it works.
@ZeroZen270 commented on GitHub (Aug 17, 2024):
So, to be clear you got Ollama working, after a rebuild with a k80? If so, what steps would one take in ubuntu noble? ty in advance
@langstonmeister commented on GitHub (Aug 18, 2024):
Technically I got it working on 2 k40c cards, but I would assume that it should also work for the k80.
I think I had to make one more edit, but I can't remember where it was now. Or else it was when I installed open-webui on top of this. I'll try to update my repos with all my stuff in the next few days, so hopefully you could just clone my repo and get it all done.
I hope that helps, it took me a while to get through it all, but I feel like it was worth it. This system is much faster than my other server with a GTX1660, and I can run some pretty huge models across the 2 cards with 24GB of VRAM in total.
@langstonmeister commented on GitHub (Aug 18, 2024):
I did have to make one more edit! In the gpu/gpu.go file, change the line about CUDA compute to:
so that it will not throw errors.
@orlyandico commented on GitHub (Aug 18, 2024):
To clarify: the above steps work on Ubuntu 24?
I had some serious problems even getting the 11.4 CUDA software installing
on 22 (seeing as it's supposed to be supported for 20.04)
On Sun, Aug 18, 2024 at 2:51 PM langstonmeister @.***>
wrote:
@langstonmeister commented on GitHub (Aug 18, 2024):
I did not have any issues with it. The only thing is making sure not to try to install the driver that comes packaged with the CUDA tools. Keep the Ubuntu-driver one, and just install the cuda version from that package.
@simsi-andy commented on GitHub (Sep 17, 2024):
Any advice for Windows users? (Besides of changing to Ubuntu) 😉😬
@bones0 commented on GitHub (Dec 10, 2024):
Hi
There is a fork https://github.com/austinksmith/ollama37 but it seems to be a bit behind the current ollama-version.
Personally I ran into too many problems with my K40/K80. Some projects flatly refuse the NVIDIA-driver 470 as "too old" etc. I'd have to build a "Legacy Rig" only for the K (and soon also the M). Which is not on top of my bullet list.
@dhiltgen commented on GitHub (Dec 10, 2024):
For instructions building from source for these older GPUs, see https://github.com/ollama/ollama/blob/main/docs/development.md#older-linux-cuda-nvidia
@ShadowGallery93 commented on GitHub (Jan 18, 2025):
This is awesome!! The back compatibility to compute 3.7 is clutch!
My tips:
@wonka929 commented on GitHub (Jan 23, 2025):
Here I am, for the same issue.
I'm on Manjaro with the last stable release.
Nvidia K3100M on the device. 4GB of ram.
Last "kind of accepted" driver is 470, but i'm not totally sure. BTW i have them installed and they'r working.
Compatible CUDA with 470 is 11.4 as seen from nvidia-smi
Kernel 6.12.
Manjaro does not have cuda11 in repos, so i downloaded them from https://archive.archlinux.org/packages/c/ (gcc10, cuda11.4, cuda-tools 11.4)
curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgzin ~/Downloadsvar ( CudaComputeMajorMin = "3" CudaComputeMinorMin = "0" )./ollama serve
but still not working.
Any idea? @ShadowGallery93
@quq233 commented on GitHub (Jan 24, 2025):
this worked for me: sudo make -j 8 CUDA_11_PATH=/usr/local/cuda-11.4 CUSTOM_CPU_FLAGS="" CUDA_ARCHITECTURES="35;37;50;52" EXTRA_GOLDFLAGS=""-X=github.com/ollama/ollama/discover.CudaComputeMajorMin=3" "-X=github.com/ollama/ollama/discover.CudaComputeMinorMin=5"" PATH="/path/to/your/go/bin:$PATH"
@wonka929 commented on GitHub (Jan 24, 2025):
@quq233 still having issues
both with CC not set (so gcc-14) and CC set to gcc-10
`$ make CC=gcc-10 CUDA_11_PATH=/opt/cuda CUSTOM_CPU_FLAGS="" CUDA_ARCHITECTURES="35;37;50;52" EXTRA_GOLDFLAGS=""-X=github.com/ollama/ollama/discover.CudaComputeMajorMin=3" "-X=github.com/ollama/ollama/discover.CudaComputeMinorMin=5"" PATH="/usr/bin/go:$PATH"
make[1]: Nessuna operazione da eseguire per «cpu».
/opt/cuda/bin/nvcc -c -Xcompiler -fPIC -D_GNU_SOURCE -fPIC -Wno-unused-function -std=c++17 -Xcompiler "" -t2 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_USE_CUDA=1 -DGGML_SHARED=1 -DGGML_BACKEND_SHARED=1 -DGGML_BUILD=1 -DGGML_BACKEND_BUILD=1 -DGGML_USE_LLAMAFILE -DK_QUANTS_PER_ITERATION=2 -DNDEBUG -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Wno-deprecated-gpu-targets --forward-unknown-to-host-compiler -use_fast_math -I./llama/ -O3 --generate-code=arch=compute_35,code=[compute_35,sm_35] --generate-code=arch=compute_37,code=[compute_37,sm_37] --generate-code=arch=compute_50,code=[compute_50,sm_50] --generate-code=arch=compute_52,code=[compute_52,sm_52] -DGGML_CUDA_USE_GRAPHS=1 -o llama/build/linux-amd64/llama/ggml-cuda/ggml-cuda.cuda_v11.o llama/ggml-cuda/ggml-cuda.cu
gcc: internal compiler error: Errore di segmentazione signal terminated program cc1plus
Please submit a full bug report,
with preprocessed source if appropriate.
See https://bugs.archlinux.org/ for instructions.
make[1]: *** [make/gpu.make:53: llama/build/linux-amd64/llama/ggml-cuda/ggml-cuda.cuda_v11.o] Error 255
make: *** [Makefile:48: cuda_v11] Error 2
`
@wonka929 commented on GitHub (Jan 27, 2025):
@ShadowGallery93 i managed to do it.
Final command:
make CUDA_11_PATH=/opt/cuda CUDA_ARCHITECTURES="30;35;37;50;52" EXTRA_GOLDFLAGS=""-X=github.com/ollama/ollama/discover.CudaComputeMajorMin=3" "-X=github.com/ollama/ollama/discover.CudaComputeMinorMin=0"" PATH="/usr/bin/go:$PATH"but i had to do:
sudo ln -sf /usr/bin/gcc-10 /usr/bin/gcc
sudo ln -sf /usr/bin/g++-10 /usr/bin/g++
to link the proper compiler because using CC=gcc-10 and GCC=g++-10 flags wasn't working.
Well....now i have an other problem which is that, after an ./ollama serve, if i try to communicate with ollama run it goes to the chat correctly but when i send a message it responds with:
Error: POST predict: Post "http://127.0.0.1:37661/completion": EOFWith this log from server:
time=2025-01-27T14:11:16.404+01:00 level=DEBUG source=sched.go:407 msg="context for request finished" time=2025-01-27T14:11:16.404+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/francesco/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff duration=5m0s time=2025-01-27T14:11:16.404+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/home/francesco/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff refCount=0 time=2025-01-27T14:11:16.414+01:00 level=DEBUG source=server.go:416 msg="llama runner terminated" error="exit status 2"No idea of the reason....
@joao-le commented on GitHub (Jan 27, 2025):
I have a K40c (computer capability 3.5, supported through the NVIDIA 470.xx Legacy drivers) and it was very easy to install with an Ubuntu 20.04 (it didn't work, on the contrary, on a Fedora 41)
Procedure:
install CUDA Toolkit 11.4 (basically follow the procedures on their page) . This also installs the drivers.
$ sudo apt install g++
$ sudo reboot
install golang (the latest version 1.23.5 worked, but not the repository version).
clone ollama and follow the procedure on Ollama's Development page:
$ git clone https://github.com/ollama/ollama.git
$ cd ollama
$ make -j 5 CUDA_ARCHITECTURES="35;37;50;52" EXTRA_GOLDFLAGS=""-X=github.com/ollama/ollama/discover.CudaComputeMajorMin=3" "-X=github.com/ollama/ollama/discover.CudaComputeMinorMin=5""
launch ollama server and the model you want to use. Ex:
$ ./ollama serve &>/dev/null
$ ./ollama run phi4
I hope this helps.
@wonka929 commented on GitHub (Jan 27, 2025):
Yep maybe i figured out.
My GPU is K3100M.
Seems that it supports just 3.0 as cuda capability.
I managed to compile with 3.0 capability but i always get the parsing error from API with 3.0.
I don't know, maybe 3.0 is just far too old to work.
Could it be?
@sanchez314c commented on GitHub (Feb 17, 2025):
Can anyone clarify these steps for someone is not a shell-guru?
I'm working with a fresh install of 24.04. Blank slate. uBuntu picks up the K80 automatically and loads the driver.
I'm not contesting that the above don't work but there are clearly some missing steps or things to check for the laments out there like me. If anyone can provide some clarity it would be greatly appreciated.
@langstonmeister commented on GitHub (Feb 17, 2025):
Do you have the CUDA drivers and toolkit installed? You will need CUDA tools 11.4, but make sure not to use the included driver - stick with the version that Ubuntu loaded for you.
Something that helped me a lot when I was getting started with the CLI is setting up the SSH from another computer. Then I can just copy and paste from my regular desktop. I'm running a headless server though, not sure exactly what your setup is.
The included documentation should be pretty copy-paste friendly.
Development documentation for Linux
Check out my comments on this thread, which should also be copy-paste friendly:
here - pt1 and
here - pt2
Hope that helps!
@Diego77648 commented on GitHub (Feb 22, 2025):
Just to update a bit in case of anyone trying to use an older GPU, the gpu.go file is in /discover/gpu.go you just need to change
var (
CudaComputeMajorMin = "5"
CudaComputeMinorMin = "0"
)
to
var (
CudaComputeMajorMin = "3"
CudaComputeMinorMin = "0"
)
and then compile ollama, then it will detect your GPU
time=2025-02-22T10:41:43.208Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-02-22T10:41:48.420Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4db0c484-9ecf-65ef-9cc9-94696b5b1ddc library=cuda variant=v11 compute=3.7 driver=11.4 name="Tesla K80" total="11.2 GiB" available="11.1 GiB"
time=2025-02-22T10:41:48.420Z level=INFO source=types.go:130 msg="inference compute" id=GPU-6ac060d1-f8a6-124e-0541-e15257875097 library=cuda variant=v11 compute=3.7 driver=11.4 name="Tesla K80" total="11.2 GiB" available="11.1 GiB
@wonka929 commented on GitHub (Feb 22, 2025):
Yes, true, but beware!
With compute module 3.0 it won't compile.
Real minimum requirement is 3.5
var (
CudaComputeMajorMin = "3"
CudaComputeMinorMin = "5"
)
@idream3000 commented on GitHub (Mar 3, 2025):
nvidia-smi : No running processes found
ollama ps : 100% GPU
but the truth is 100% CPU usage!
@idream3000 commented on GitHub (Mar 5, 2025):
Finally done!
share at:
https://github.com/idream3000/ollama37.git
@webclinic017 commented on GitHub (Mar 7, 2025):
@idream3000 is there a windows build for the same
@dogkeeper886 commented on GitHub (Apr 3, 2025):
You're a legend! Your Git and hints are seriously helping me out. GCC-10 is a lifesaver.
@aecium commented on GitHub (Apr 9, 2025):
I have gotten @idream3000 's custom repo to build and use my k80 following the below steps. I have built this a few times to test but if you run into problems let me know.
@idream3000 absolute legend for the record.
I did a fresh install of Ubuntu 22.04 (I unpluged my K80 during the install so it would not install nvidai drivers)
once install is complete I updated
$ sudo apt update
$ sudo apt upgrade
grab gcc-10 which will be needed later on for compiling ollama37
$ sudo apt install gcc-10
remove default gcc
$ sudo rm /usr/bin/gcc
set default gcc to gcc-10 by creating a symlink
$ sudo ln -s /usr/bin/gcc-10 /usr/bin/gcc
$ sudo apt install g++-10
set g++ to g++ 10
$ sudo rm /usr/bin/g++
$ sudo ln -s /usr/bin/g++-10 /usr/bin/g++
$ sudo apt install cmake
$ sudo snap install go --classic
you want go version go1.24.2 linux/amd64 you will need to reboot before you can run go version
remove all nvidia drivers if there are any. I unplugged my K80 during the Ubuntu install so there were none to remove
download cuda_11.4.0_470.42.01_linux.run using wget command below or from https://developer.nvidia.com/cuda-11-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=runfile_local
$ wget https://developer.download.nvidia.com/compute/cuda/11.4.0/local_installers/cuda_11.4.0_470.42.01_linux.run
$ sudo sh cuda_11.4.0_470.42.01_linux.run
If the installer warns you about nvidia drivers already installed cancel and remove/purg all nvidia drivers.
when the install asks what you want to install deselect everything but the CUDA toolkit
once the CUDA toolkit is installed install the Nvidia 470 server driver
$ sudo apt install nvidia-driver-470-server
reboot
after the reboot run nvidia-smi to check that it sees your k80
$ nvidia-smi
Mon Mar 17 15:52:16 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02 Driver Version: 470.256.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:03:00.0 Off | 0 |
| N/A 27C P8 27W / 149W | 4MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00000000:04:00.0 Off | 0 |
| N/A 32C P8 26W / 149W | 4MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 936 G /usr/lib/xorg/Xorg 3MiB |
| 1 N/A N/A 936 G /usr/lib/xorg/Xorg 3MiB |
+-----------------------------------------------------------------------------+
add the cuda 11.4 nvcc temporally to your PATH var this will need to be re-run if you reboot or open a new terminal
$ export PATH=${PATH}:/usr/local/cuda-11.4/bin/
$ cd ollama37
$ cmake -B build
you want to see "-- Looking for a CUDA compiler - /usr/local/cuda-11.4/bin/nvcc" towards the end output if you don't make sure /usr/local/cuda-11.4/bin/nvcc exists and /usr/local/cuda-11.4/bin/ is in your PATH
$ cmake --build build
when that is done run
$ go run . serve
in another terminal run
$ go run . run llama3
if you want to monitor the K80 gpus you can run in a 3rd terminal
$ watch nvidia-smi
example output you can see that GPU 1 is at 93% while llama3 is answering my question "Tell me about the moon"
Tue Apr 8 17:51:25 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02 Driver Version: 470.256.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:03:00.0 Off | 0 |
| N/A 45C P8 27W / 149W | 3485MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00000000:04:00.0 Off | 0 |
| N/A 45C P0 148W / 149W | 6114MiB / 11441MiB | 93% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 949 G /usr/lib/xorg/Xorg 3MiB |
| 0 N/A N/A 201371 C ...092e7b06ed6b05be-d/ollama 3477MiB |
| 1 N/A N/A 949 G /usr/lib/xorg/Xorg 3MiB |
| 1 N/A N/A 204890 C ...092e7b06ed6b05be-d/ollama 6107MiB |
+-----------------------------------------------------------------------------+
@MORITZ0405 commented on GitHub (Apr 17, 2025):
is there a build for windows
@aecium commented on GitHub (Apr 17, 2025):
@MORITZ0405 in theory if the tool chain needed to build it, gcc-10, g++-10, Cmake, and such have versions that run on windows you should be able to build this in the same manner on windows.
However you might find it easier to follow steps laid out above by setting up a local VM using Vbox or hyper-V on your windows system that has the K80 in it running Ubuntu 22.04.
here is a link to documentation for setting up Ubuntu Vbox 7 https://ubuntu.com/tutorials/how-to-run-ubuntu-desktop-on-a-virtual-machine-using-virtualbox#1-overview
And here is one for the same on HyperV https://documentation.ubuntu.com/server/how-to/virtualisation/ubuntu-on-hyper-v/index.html
I just did a quick search for those and did not look through them but should be a good place to start.
Hope that helps
@dogkeeper886 commented on GitHub (Apr 18, 2025):
If you don't mind using docker image. I personally use the Linux, don't have widows environment.
https://hub.docker.com/r/dogkeeper886/ollama37
@hdnh2006 commented on GitHub (Apr 18, 2025):
thank you mate! you are the best!!
@LeGuipo commented on GitHub (Apr 22, 2025):
So I just did Diego77648’s trick and it works. My GTX 780 SC is definitely NOT TOO OLD ANYMORE 🥳
Though I cannot run more than 3B of size models with it, text generation time is more than halved.
I’m happy to not have to waste money in another GPU just for toying with ollama.
@KeemOnGithub commented on GitHub (Aug 19, 2025):
I tried building this on Windows using
go build ., but the .exe file generated won't open. Does anyone know why?@aecium commented on GitHub (Aug 19, 2025):
Does it run and work if you do something like this. From the root of the cloned GitHub repo run 'go run . serve'
@idream3000 commented on GitHub (Aug 19, 2025):
Please try wsl on windows!Sent from my iPhoneOn Aug 19, 2025, at 20:22, Aecium @.***> wrote:aecium left a comment (ollama/ollama#1756)
Finally done! share at: https://github.com/idream3000/ollama37.git
I tried building this on Windows using go build ., but the .exe file generated won't open. Does anyone know why?
Does it run and work if you do something like this. From the root of the cloned GitHub repo run 'go run . serve'
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
@KeemOnGithub commented on GitHub (Aug 20, 2025):
It compiles, thank you! I can't get it to recognize my GPUs though... I think the latest drivers for Tesla k20c don't support wsl2.
@KeemOnGithub commented on GitHub (Aug 20, 2025):
Ah yes, go run . serve does work. An .exe would be more convenient for me, but this is fine for testing.
Thanks!
@idream3000 commented on GitHub (Aug 20, 2025):
My mod is only work for K80(compute capability 3.7),you can modify the code for K20C by yourself simply add “3.5;” before “3.7;” at the modified file.