mirror of
https://github.com/ollama/ollama.git
synced 2026-05-07 00:22:43 -05:00
Closed
opened 2026-04-22 11:03:29 -05:00 by GiteaMirror
·
86 comments
No Branch/Tag Specified
main
hoyyeva/anthropic-local-image-path
dhiltgen/ci
dhiltgen/llama-runner
parth-remove-claude-desktop-launch
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.30.0-rc3
v0.30.0-rc2
v0.30.0-rc1
v0.30.0-rc0
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#30990
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @robbyjo on GitHub (Dec 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/8188
What is the issue?
I have 2x4090 and 192GB RAM on my Windows 11 machine. I am currently using Ollama 0.5.4. I am using the following model with 32K context:
hf.co/mradermacher/Llama-3.3-70B-Instruct-abliterated-i1-GGUF:Q5_K_M
Theoretically, I could fit in most of the layers to my 2x24GB VRAM. The VRAM usage is currently about 1.5GB on my machine (via nvidia-smi) and that's only on one card. The other one is empty. I already set the CUDA_VISIBLE_DEVICES to 0,1
However, if I use both GPUs, then the output is garbage (like random words and punctuations and sometimes foreign words). If I limit myself to one GPU (say, by offloading only 28 layers to GPU), then the output is fine albeit a bit slow. I tried many things, such as, enabling or disabling flash attention, KV cache type, it doesn't seem to work. If I enable OLLAMA_SCHED_SPREAD, then the output will be garbled no matter what (regardless of how many layers I offload to GPU).
Example output: ":[-":[- Doug":[-keypress Ment":[-":[- spline":[-":[-":[- klu락огра":[-":[-":[-":[-":[-":[-":[-":[-":[-":[-isser":[-<Path popularity":[- menstratori#ab направ slate":[-(indices":[-uerdo.serialلقwnerkie":[-
I read about setting OLLAMA_GPU_OVERHEAD to avoid corruption like this, but the output is still garbled.
In retrospect, my issue is like PR #7575, except that I am using 2x4090 and in Windows 11. I would love any pointers. I honestly think this is a bug.
Thank you so much.
Here is server.log:
time=2024-12-20T14:17:01.858-05:00 level=INFO source=images.go:757 msg="total blobs: 74"
time=2024-12-20T14:17:01.859-05:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0"
time=2024-12-20T14:17:01.860-05:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)"
time=2024-12-20T14:17:01.861-05:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[rocm_avx cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx]"
time=2024-12-20T14:17:01.861-05:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2024-12-20T14:17:01.861-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2024-12-20T14:17:01.861-05:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2024-12-20T14:17:01.861-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=24 efficiency=16 threads=32
time=2024-12-20T14:17:02.092-05:00 level=INFO source=gpu.go:334 msg="detected OS VRAM overhead" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" overhead="1.1 GiB"
time=2024-12-20T14:17:02.094-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-a8d5d6f9-5cd0-cdb8-2cf7-9e58f4786c23 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
time=2024-12-20T14:17:02.094-05:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-fdd12bb6-e728-5d85-46aa-70331addbfb8 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
[GIN] 2024/12/20 - 14:17:02 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/12/20 - 14:17:02 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2024/12/20 - 14:17:36 | 200 | 2.0599ms | 127.0.0.1 | GET "/api/tags"
time=2024-12-20T14:17:36.410-05:00 level=WARN source=types.go:509 msg="invalid option provided" option=stream_response
time=2024-12-20T14:17:36.646-05:00 level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 library=cuda parallel=1 required="28.3 GiB"
time=2024-12-20T14:17:36.676-05:00 level=INFO source=server.go:104 msg="system memory" total="191.7 GiB" free="159.5 GiB" free_swap="242.5 GiB"
time=2024-12-20T14:17:36.708-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=28 layers.model=81 layers.offload=28 layers.split=14,14 memory.available="[22.5 GiB 22.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="61.6 GiB" memory.required.partial="28.3 GiB" memory.required.kv="5.0 GiB" memory.required.allocations="[14.1 GiB 14.1 GiB]" memory.weights.total="50.0 GiB" memory.weights.repeating="49.2 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="4.3 GiB" memory.graph.partial="4.3 GiB"
time=2024-12-20T14:17:36.708-05:00 level=INFO source=server.go:223 msg="enabling flash attention"
time=2024-12-20T14:17:36.712-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 --ctx-size 32768 --batch-size 512 --n-gpu-layers 28 --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 14,14 --port 63440"
time=2024-12-20T14:17:36.715-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-12-20T14:17:36.716-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2024-12-20T14:17:36.716-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2024-12-20T14:17:36.789-05:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
time=2024-12-20T14:17:36.872-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8
time=2024-12-20T14:17:36.873-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:63440"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 4090) - 22994 MiB free
time=2024-12-20T14:17:36.966-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 47 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated
llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated
llama_model_loader: - kv 4: general.basename str = Llama-3.3
llama_model_loader: - kv 5: general.size_label str = 70B
llama_model_loader: - kv 6: general.license str = llama3.3
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct
llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ...
llama_model_loader: - kv 13: llama.block_count u32 = 80
llama_model_loader: - kv 14: llama.context_length u32 = 131072
llama_model_loader: - kv 15: llama.embedding_length u32 = 8192
llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 17: llama.attention.head_count u32 = 64
llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: llama.attention.key_length u32 = 128
llama_model_loader: - kv 22: llama.attention.value_length u32 = 128
llama_model_loader: - kv 23: general.file_type u32 = 17
llama_model_loader: - kv 24: llama.vocab_size u32 = 128256
llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - kv 36: general.url str = https://huggingface.co/mradermacher/L...
llama_model_loader: - kv 37: mradermacher.quantize_version str = 2
llama_model_loader: - kv 38: mradermacher.quantized_by str = mradermacher
llama_model_loader: - kv 39: mradermacher.quantized_at str = 2024-12-13T11:20:37+01:00
llama_model_loader: - kv 40: mradermacher.quantized_on str = rich1
llama_model_loader: - kv 41: general.source.url str = https://huggingface.co/huihui-ai/Llam...
llama_model_loader: - kv 42: mradermacher.convert_type str = hf
llama_model_loader: - kv 43: quantize.imatrix.file str = Llama-3.3-70B-Instruct-abliterated-i1...
llama_model_loader: - kv 44: quantize.imatrix.dataset str = imatrix-training-full-3
llama_model_loader: - kv 45: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 46: quantize.imatrix.chunks_count i32 = 314
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 46.51 GiB (5.66 BPW)
llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
[GIN] 2024/12/20 - 14:17:38 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/12/20 - 14:17:38 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloaded 28/81 layers to GPU
llm_load_tensors: CPU model buffer size = 688.88 MiB
llm_load_tensors: CUDA_Host model buffer size = 30736.73 MiB
llm_load_tensors: CUDA0 model buffer size = 7978.13 MiB
llm_load_tensors: CUDA1 model buffer size = 8224.63 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_ctx_per_seq = 32768
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 3536.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 952.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 952.00 MiB
llama_new_context_with_model: KV self size = 5440.00 MiB, K (q8_0): 2720.00 MiB, V (q8_0): 2720.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.52 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 176.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 80.01 MiB
llama_new_context_with_model: graph nodes = 2247
llama_new_context_with_model: graph splits = 578 (with bs=512), 4 (with bs=1)
time=2024-12-20T14:17:52.507-05:00 level=INFO source=server.go:594 msg="llama runner started in 15.79 seconds"
llama_model_loader: loaded meta data with 47 key-value pairs and 724 tensors from E:\DeepLearning\LLM\blobs\sha256-4dfc7c22ba83a22c83ad3dc0cf280792bb794cbf139a080f3df70e2b13c94090 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.3 70B Instruct Abliterated
llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated
llama_model_loader: - kv 4: general.basename str = Llama-3.3
llama_model_loader: - kv 5: general.size_label str = 70B
llama_model_loader: - kv 6: general.license str = llama3.3
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3.3 70B Instruct
llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 11: general.tags arr[str,7] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 12: general.languages arr[str,8] = ["en", "fr", "it", "pt", "hi", "es", ...
llama_model_loader: - kv 13: llama.block_count u32 = 80
llama_model_loader: - kv 14: llama.context_length u32 = 131072
llama_model_loader: - kv 15: llama.embedding_length u32 = 8192
llama_model_loader: - kv 16: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 17: llama.attention.head_count u32 = 64
llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: llama.attention.key_length u32 = 128
llama_model_loader: - kv 22: llama.attention.value_length u32 = 128
llama_model_loader: - kv 23: general.file_type u32 = 17
llama_model_loader: - kv 24: llama.vocab_size u32 = 128256
llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 27: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - kv 36: general.url str = https://huggingface.co/mradermacher/L...
llama_model_loader: - kv 37: mradermacher.quantize_version str = 2
llama_model_loader: - kv 38: mradermacher.quantized_by str = mradermacher
llama_model_loader: - kv 39: mradermacher.quantized_at str = 2024-12-13T11:20:37+01:00
llama_model_loader: - kv 40: mradermacher.quantized_on str = rich1
llama_model_loader: - kv 41: general.source.url str = https://huggingface.co/huihui-ai/Llam...
llama_model_loader: - kv 42: mradermacher.convert_type str = hf
llama_model_loader: - kv 43: quantize.imatrix.file str = Llama-3.3-70B-Instruct-abliterated-i1...
llama_model_loader: - kv 44: quantize.imatrix.dataset str = imatrix-training-full-3
llama_model_loader: - kv 45: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 46: quantize.imatrix.chunks_count i32 = 314
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 46.51 GiB (5.66 BPW)
llm_load_print_meta: general.name = Llama 3.3 70B Instruct Abliterated
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
[GIN] 2024/12/20 - 14:18:16 | 200 | 39.8873643s | 127.0.0.1 | POST "/api/chat"
[GIN] 2024/12/20 - 14:18:16 | 200 | 6.2904ms | 127.0.0.1 | GET "/api/tags"
OS
Windows
GPU
Nvidia
CPU
Intel
Ollama version
0.5.4
@rick-github commented on GitHub (Dec 20, 2024):
Garbled output sometimes means that the context window was exceeded. What size of request are you sending? If you set
OLLAMA_DEBUG=1in the server environment the logs will contain more information that may be useful.@robbyjo commented on GitHub (Dec 20, 2024):
I am not sure what size of request. It is not big (1772 characters). Here is the server.log with OLLAMA_DEBUG on
@YonTracks commented on GitHub (Dec 21, 2024):
cheers, that seems correct, needs more num_ctx:
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized}
@robbyjo commented on GitHub (Dec 21, 2024):
I did set the num_ctx to 32K before the query. I used Open WebUI for it, Now I used the command line:
Here is the answer to my query:
It appears that Ollama first run with default num_ctx (2048) and then set it to 32768 after my request of /set parameter num_ctx command. And as you see above in my previous log, there is this
--ctx-size 32768. I tried this command line switch to ollama but does not seem to work. So something must be amiss. Please do not dismiss this bug for a rookie mistake.Here is the server.log in debug mode:
@YonTracks commented on GitHub (Dec 21, 2024):
actually srry, I just spotted it and remembered. I think this is same issue as #7984
hopefully someone will know.
@robbyjo commented on GitHub (Dec 21, 2024):
It may not be the same since I set the number of context properly. Getting to 128K context doesn't make any difference. I got a beefy PC (192GB RAM and 2x4090 with 24GB VRAM each). If I only fill in one Graphic card (Or simply set CUDA_VISIBLE_DEVICES=1 instead of 0,1), then the whole thing works, including 128K context. However, my desire is to use BOTH of my GPUs, not just one. And this is on Windows 11, by the way. I heard that this was no problem in Linux.
@YonTracks commented on GitHub (Dec 21, 2024):
ahh yes good info cheers.
yes, I see the 2 changes with both gpus.
llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not beand
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be.why?
someone will know, I will keep investigating anyway If I find anything I will share the info.
good luck.
@YonTracks commented on GitHub (Dec 21, 2024):
you say only on windows? good info cheers.
will be here llm/server.go for the params of the other model? something is happening lol, but its late, so thats all I got for now
actually, I think we nailed it? how are you setting the num_ctx, I'm not sure that
will persist to both models?
try options? and other ways, maybe openwebui but good progress.
cheers
good luck.
@robbyjo commented on GitHub (Dec 21, 2024):
Thanks. Not 100% sure about Windows thing. That was only my impression.
Not sure what you mean by this statement. This issue happens to all models. But I tried this also with ollama command line and the same thing happened. Are you saying that the num_ctx was set on one GPU, but not the others? That'd be strange.
I am not sure about how to debug Go language. This would be my first exposure. I could handle some other languages (like C/C++, Python, Java or R)
@YonTracks commented on GitHub (Dec 21, 2024):
Are you saying that the num_ctx was set on one GPU, but not the others? That'd be strange.
yep, that's what seems to be happening this needs to be checked.
@robbyjo commented on GitHub (Dec 21, 2024):
Ok, I'd be happy to test a test build if a precompiled binary for Windows is available.
@rick-github commented on GitHub (Dec 22, 2024):
num_ctxis a per-model setting, not per-GPU. Have you tested with older versions of ollama? For example, 0.3.14 uses C++ runners rather than go.@YonTracks commented on GitHub (Dec 22, 2024):
I bet it worked, then for that reason, "uses C++ runners rather than go".
I'm seeing an issue with the env params being passed to the other gpus, the other gpus are using the default? not the set params (and a windows thing for sure lol), I can't test multi gpu, if you can? then check that? try hard code the gpu list and params or something.
llm/server.go: line 173:
this should hard code the num_ctx.
for me:
with this change, build / make the dev mode, and go build . etc...
then for quick testing I copy the new ollama.exe and paste in the "\AppData\Local\Programs\Ollama\ollama.exe"
to keep the original I change the ollama.exe to .txt so I can change it back lol.
good luck
hope thats ok, better than compiling a OllamaSetup.exe? I could do that but seems scary, not safe practice (only if from ollama, I am not ollama lol).
but: 0.5.4-yontracks
for hardcoded"--ctx-size", strconv.Itoa(32768)` and correct OllamaSetup.exe wizard size.I'll share the link when the build completes
actually, I can't do that file size is too big, I tried.
better to build and test anyway.
good luck
@robbyjo commented on GitHub (Dec 22, 2024):
Thank you for the insight. I tried version 0.3.14 and set the context (num_ctx) to 131072 and IT WORKED!!!!! THANK YOU SO MUCH!!!
I tested the following model:
Server log:
@rick-github commented on GitHub (Dec 22, 2024):
@robbyjo Great! Would it be possible for you to try different versions until you find the one where it fails? 0.4.0 is the likely culprit because of the switch go go runners, but there has been other work between there and 0.5.4 that might also have caused the problem. If it can be nailed down to a specific version there's a better chance of finding and fixing the root cause.
@robbyjo commented on GitHub (Dec 22, 2024):
@rick-github It worked last at 0.4.7, failed at 0.5.0.
However for the entire 0.4.x, I saw the GPU VRAM usage only at ~6GB for GPU 0 and ~5GB for GPU 1 instead of ~22GB each for 0.3.14. And it feels slow as well compared to single GPU on 0.5.4.
Edit: Clarification. For 0.4.x, GPU VRAM usage was ~22GB BEFORE I change the num_ctx parameter and put in my query.
@robbyjo commented on GitHub (Dec 22, 2024):
For version 0.4.7, I tried the /set parameter num_gpu 48 (and 54, 60, 64), it failed with cudaMalloc failed: out of memory if num_ctx is 131072
Strangely enough, for version 0.4.7, setting num_gpu to 54 and num_ctx to 32768, the output is garbled again even though the VRAM usage is up.
Setting num_gpu to 40 and num_ctx to 32768, the GPU usage was up to about ~12GB each card, but the output is also garbled.
@rick-github commented on GitHub (Dec 22, 2024):
Could you add the log for 0.4.7?
@robbyjo commented on GitHub (Dec 22, 2024):
Ok. Here is the log for 0.4.7. At first, I only changed num_ctx to 131072, which worked great except for low memory utilization. I interrupted the output. Then I changed num_ctx to 32768 and num_gpu to 48 and repeated the same query. The result was then garbled.
@YonTracks commented on GitHub (Dec 23, 2024):
great progress cheers.
what have we learned / confirmed, should make a list.
I think but want to be sure:
when using multi-GPU's the params are only being set on the main gpu, and the others switch to default seems (0.4.7) kv context and grammar updates.
edit^ according to the logs, seems 0.4.7 is also switching back to default ctx on the second gpu.
check
2cd11ae365default num_ctx can be too small in some cases, was this effected in earlier versions, and I wonder if slightly larger default would help or hinder overall.
windows and or linux etc?
@robbyjo commented on GitHub (Dec 23, 2024):
@YonTracks Thanks.
Not sure about your first question, but for your second question usually for Llama 3.1, 32K context should be enough. Also, for your third question, I can only confirm Windows. Cannot confirm Linux since I only have Win 11 machines.
For 0.4.x, I noticed that the GPU VRAM utilization was quite low (~6GB for GPU 0 and ~5GB for GPU 1) and it was a lot slower than single GPU (GPU 1) for v0.5.4.
@YonTracks commented on GitHub (Dec 23, 2024):
need to confirm the params are not being used on the other gpus and why?
@YonTracks commented on GitHub (Dec 23, 2024):
hardcoded-ctx.txt
I think with a hardcoded num_ctx we can confirm this? if the num_ctx still dont get used on the other gpu's... theres your problem lol.
@robbyjo commented on GitHub (Dec 23, 2024):
I'd be more than happy to test this, but how?
@YonTracks commented on GitHub (Dec 23, 2024):
need a local dev build and multi-GPU's?
I think the pro's have enough info?
super cheers, I will leave it to those fine folks.
unless your keen and have a dev build.
@robbyjo commented on GitHub (Dec 23, 2024):
I currently do not have a dev build. I have MSVC 2022, MSYS2, and Go 1.23 installed. But unsure how to proceed.
@YonTracks commented on GitHub (Dec 23, 2024):
I bet jesse will know should we ping it is holidays, rick will know?
yep, windows dev build for me was tricky, and even when I thought I had it compiled and all correct, and it still ollama will work but issues.
it wasn't until I could 'make' via the .iss script and build a setup exe I could confirm the dev build was correct, so safest to just not?
but if keen for learning even, like me (as long as we don't hinder lol).
https://github.com/ollama/ollama/blob/main/docs/development.md
good luck super cheers for your help
@rick-github commented on GitHub (Dec 23, 2024):
There is no per-GPU context window. The context window is set for the runner via
opts.NumCtxand hardcoding it inserver.gois exactly the same as"options":{"num_ctx":32768}. The change inctx-sizein the logs is due to the model being loaded with the default context size for one API call and then being loaded with a difference context size when the API specifiesnum_ctx. The KV buffer allocated on each GPU will vary by the number of layers assigned to the GPU, but the total of all KV buffers will always sum to the proportional value ofctx-sizegiven to the runner. Reloading the model when the context size changes is a problem in some circumstances that could be alleviated with https://github.com/ollama/ollama/pull/8029.@YonTracks commented on GitHub (Dec 23, 2024):
yes, perfect thank you! , thats what I needed to know,
I will investigate.
exactly why, (the reloaded bit) when multi-GPU's, does the num_ctx start with the set num_ctx param correctly, but then switch back to default? and when this happens, the multi-GPU call output gets corrupted, same as if 1 gpu with low ctx for these particular 70b models.
and? is it only windows?
I see in the logs that is says
llama_new_context_with_model: n_ctx = 2048when the issue is happening and 2048 is default?, even when multi-GPU's so can confirm the issue is how that all happens.expected behavior:
the multi-gpu logs shows the same num_ctx as the main either by using same num or splitting half or whatever but the multi-gpu's are happy?
@YonTracks commented on GitHub (Dec 23, 2024):
"Awesome! I think that's it, or very closely related. Thank you so much for your hard work and dedication!"
@robbyjo commented on GitHub (Jan 3, 2025):
FYI, Kobold 1.80.3 seems to have the same issue. So perhaps this is a llama.cpp issue.
@rick-github commented on GitHub (Jan 3, 2025):
It wasn't the switch to go runners, and the only big change from 0.4.7 to 0.5.0 was the k/v quantization which makes use of parts of llama.cpp that weren't used before. It would be helpful to retry the tests with
OLLAMA_FLASH_ATTENTION=0. The initial post said that on/off FA had been tried, but all of the logs have it on.@robbyjo commented on GitHub (Jan 3, 2025):
I turned off OLLAMA_FLASH_ATTENTION. Its output is garbled. Here is a sample of it:
Here is the debug output of server.log:
@robbyjo commented on GitHub (Jan 7, 2025):
Could it be that the update from v0.4.7 to 0.5.0 involves overwriting parts that were essential to multi-GPUs for Windows?
@rick-github commented on GitHub (Jan 8, 2025):
It's looking more like a Windows specific issue. I configured a linux server with 2x4090, CUDA 12.7 and it works fine with the
Write the game of Tetris in Pythonprompt and the writing prompt from the first log. Something you could try is installing the WSL CUDA driver and running ollama inside a WSL container. That would remove any Window differences from ollama, and it would be just down to the Nvidia driver and the cards.@YonTracks commented on GitHub (Jan 9, 2025):
this is very good info cheers:
I turned off OLLAMA_FLASH_ATTENTION. Its output is garbled.and
windows specificandOLLAMA_NUM_PARALLELandIf I enable OLLAMA_SCHED_SPREAD, then the output will be garbled no matter what (regardless of how many layers I offload to GPU).for this particular issue
num_ctxbut also
embeddingsissue very related!this will take me a while, a few things happening here, but:
all points back to the
Normalize the NumCtx for parallelismserver/sched.gowith theneedsReload(ctx context.Context, req *LlmRequest)and windows, and more.using
go testI will find it.edit^: actually the issue is showing via the go test.
go test -tags=integration ./...awesome, I should be able to sort it.wish I could explain better, forgive me. I will show via code itself.
@YonTracks commented on GitHub (Jan 9, 2025):
Howdy @rick-github,
should I just attempt to fix this, which is a few issues all related? but then what? do I try PR with full fix?
or
PR with each fix,
or
try to reveal each issue 1 at a time, either here or other related,
or a new bug report,
or
I should just wait, as the mob already are on to it?
I fear this issue/s is making ollama look bad?
lol hope this makes sense, a little at least.
@rick-github commented on GitHub (Jan 9, 2025):
If you think you understand the problem, go ahead and make a PR with a full fix with test cases to demonstrate the problem and resolution. If it turns out to be too big, the reviewers will likely provide guidance on how to split it into manageable chunks.
@robbyjo commented on GitHub (Jan 10, 2025):
@rick-github I tried running ollama on WSL2. I already installed the NVidia drivers per instruction, but I got "Error: timed out waiting for llama runner to start - progress 0.00 -"
@rick-github commented on GitHub (Jan 10, 2025):
Since ollama needs to communicate with the GPU via the WSL/host interface it's a bit slower, so the model load timed out before it finished. You can extend the timeout by setting
OLLAMA_LOAD_TIMEOUT=30min the environment of the ollama server inside the WSL container.@robbyjo commented on GitHub (Jan 10, 2025):
@rick-github Thanks for the guidance. I ran ollama from WSL and I still got the garbled result. The query was the same: "Write the game of Tetris in Python"
Sample output:
@rick-github commented on GitHub (Jan 10, 2025):
OK, so that didn't eliminate any variables. What we know:
I see a docker path in the logs, are you running ollama in docker on Windows, or bare metal?
From the logs, it looks like you've tried the following models:
These models seem finetuned, have you tried a stock model from the ollama library, eg llama3.1:70b-instruct-q4_K_M?
@robbyjo commented on GitHub (Jan 10, 2025):
@rick-github I tried the stock model you indicated (llama3.1:70b-instruct-q4_K_M and it still didn't work. I did it both in straight Windows or WSL.
By the way, I mostly used 0.5.4, not 0.5.0.
@JohnSmithToYou commented on GitHub (Jan 11, 2025):
I have the same problem with my 2x4090 under wsl2. It looks like the new cache feature broke dual graphics cards. I don't get garbage but my graphics cards are only half loaded. I just loaded Qwen2.5-Coder-32B-Instruct-Q6_K with a context of 98304. My cards should be filled! Instead they are only half full and its offloading to CPU.
@YonTracks In the log form above... Do these number look correct? Is minimum_memory correct? Isn't that that the sum of both graphics cards? This is what my log shows also.
@rick-github commented on GitHub (Jan 11, 2025):
Full log.
@YonTracks commented on GitHub (Jan 11, 2025):
windows and goroutines, I'm seeing fun lol.
found a few things, trying fixes, but still more I think lol.
I am trying :)
https://github.com/ollama/ollama/pull/8029
edit^ I see a few updates already that should help this, in the code already, based on OLLAMA_FLASH_ATTENTION also.
ggml and cuda issues.
edit^
yep, last bit, yeeew :) runner.go
build and test. lets go. please lol.
cheers
@rick-github commented on GitHub (Jan 11, 2025):
Going back through the thread I realized I missed something.
So 0.4.7+ctx=128k+offload=4 works, 0.4.7+ctx=32k+offload=48 fails? 0.4.7 is not immune?
@YonTracks commented on GitHub (Jan 11, 2025):
my mind is mush lol, I'm having trouble getting my external gpu working for multi-gpu testing.
it was working but pci issues and riser, bugger.
heres the repo so far.
https://github.com/YonTracks/ollama-yontracks/tree/ollama-sched-server
maybe, rick or some multi-GPU user should test.
cheers, good luck
@YonTracks commented on GitHub (Jan 12, 2025):
far out motherboard issues, maybe I did it, I could only simulate but no more fails, and fast as, when I confirm I will PR.
@robbyjo commented on GitHub (Jan 12, 2025):
@YonTracks Just wondering if you could make a test build so that I could test it for you? Thanks!
@YonTracks commented on GitHub (Jan 12, 2025):
@robbyjo I'm not sure how, I will try learning to do that, fast, but Wednesday I should be able to test proper anyway. cheers, I'll try soon.
@rick-github commented on GitHub (Jan 12, 2025):
In the meantime, is it correct that 0.4.7 is not immune?
@YonTracks commented on GitHub (Jan 12, 2025):
yes, I'm pretty sure 0.4.7 and earlier, are not immune, but slight changes make it work sometimes can't confirm multi-GPU.
pretty sure the 3.2 vision, 0.4.0, is the start of these issues, and better and better each update, but never fully immune.
@YonTracks commented on GitHub (Jan 12, 2025):
yep, wow, far out I managed to do it, epic! cheers.
https://github.com/YonTracks/ollama-yontracks/releases/tag/0.5.4-yontracks
@robbyjo commented on GitHub (Jan 13, 2025):
@YonTracks Thanks a lot for the custom build. Really appreciate it.
For "Write a game of Tetris in Python"
The output is still garbled, though: (>G>4#F"C+923)H<&6C!3:B"97
Also note that if I set CUDA_VISIBLE_DEVICE=1,0, it simply ignored GPU 0 and only loaded the model to GPU 1.
@YonTracks commented on GitHub (Jan 13, 2025):
@robbyjo cheers, can you experiment with default settings only, so no changes to visible devices, keep alive, parallel etc. and then, experiment with OLLAMA_FLASH_ATTENTION, OLLAMA_SCHED_SPREAD. checking the logs for the details, cheers.
good luck.
@robbyjo commented on GitHub (Jan 13, 2025):
@YonTracks The failure I posted earlier was with both OLLAMA_FLASH_ATTENTION and OLLAMA_SCHED_SPREAD equal false.
Any possible combination the two flags (true/true, true/false, false/true) still yielded garbled outputs.
@YonTracks commented on GitHub (Jan 13, 2025):
@robbyjo awesome, same same, cheers for your help, very appreciative.
edit:^ I forgot to mention, just in case (you might not know or forget) when making changes to env? need to restart ollama and the client.
edit:^ srry about the many edits, also just in case, at the top of the server.log you can check the env details to be sure.
can or did you try the num_ctx? if not wanting to try much, try near max num_ctx (131072) and lower from there, or vise versa if no trouble. if any trouble, don't worry.
ollama will sort soon it anyway, most likely if not already.
I will keep trying anyway when I can.
cheers
@robbyjo commented on GitHub (Jan 13, 2025):
@YonTracks Yes, I did restart ollama. No worries. I tried the experiment with num_ctx set to 16384.
BUT!!!! I got a new finding. When I load the model on GPU 1 only, it was all good. HOWEVER, if I load the model only on GPU 0, then the output is also garbled!!! Did that tell you something?
@YonTracks commented on GitHub (Jan 13, 2025):
sure did super cheers, I now can't wait until Wednesday or Thursday to try multi-gpu's.
again, very much appreciated.
cheers.
@robbyjo commented on GitHub (Jan 13, 2025):
Thanks a lot @YonTracks and @rick-github . Really appreciate what you did. I would love to help you as much as I could. Would love to learn as well.
@rick-github commented on GitHub (Jan 13, 2025):
If you load the model only on GPU 0 and use standard 0.5.4, is the output garbled?
@YonTracks commented on GitHub (Jan 13, 2025):
There is one last thing I can try today? the current test is with the following commented out? it is supposed to be included, but good test.
@YonTracks commented on GitHub (Jan 13, 2025):
missed that, cheers yep, good idea.
edit:^ If my way of thinking is correct, we should expect to see:
this is my main priority bug?
@YonTracks commented on GitHub (Jan 13, 2025):
https://github.com/YonTracks/ollama-yontracks/releases/tag/0.5.4-test2-yontracks
@robbyjo commented on GitHub (Jan 13, 2025):
@rick-github Yes, using Ollama official 0.5.4, loading to only GPU 0 leading to Garbled output. I'll try the second test @YonTracks , be right back.
@robbyjo commented on GitHub (Jan 13, 2025):
@YonTracks I tested your test2 build. For some reason, no matter what I request (GPU 0 only or GPU 1 only), the model will always be loaded on GPU 1 and that worked. I do notice that sometimes the official build somehow ignore this selection as well.
If I set CUDA_VISIBLE_DEVICES to both GPU0 and GPU1, then the output is still garbled.
@YonTracks commented on GitHub (Jan 13, 2025):
in the server.log
you will see when working the ctx is showing a larger kv / ctx example:
llama_new_context_with_model: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be...and when it does not work,
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be...and (2048) is default too small, and also fallback for parallel, I bet if I force a default num_ctx of (32768) or the size needed, then it will work?
server.log would be good to see.
and
time=2025-01-13T13:36:32.664+10:00 level=DEBUG source=sched.go:611 msg=optsExisting: ""="{NumCtx:2048 NumBatch:512 NumGPU:-1 MainGPU:0 LowVRAM:false F16KV:false LogitsAll:false VocabOnly:false UseMMap: UseMLock:false NumThread:0}"
time=2025-01-13T13:36:32.664+10:00 level=DEBUG source=sched.go:612 msg=ctx: ""="context.Background.WithDeadline(2025-01-13 13:36:32.763406 +1000 AEST m=+2.164479501 [99.3874ms]).WithCancel"
@rick-github commented on GitHub (Jan 13, 2025):
I'm leaning to the idea that your GPU 0 is sub-optimal in some way. The switch from 0.3.14 to 0.5.* that saw the onset of this problem might be because the GPU kernels are executing in a different part of the GPU/VRAM, not because of any change in the code. It would explain the good/bad results for 0.4.7 - the different context sizes moves stuff around inside the GPU. Since nobody else has so far reported a problem, it narrows it down to your particular setup/configuration. Have you tried running a GPU/VRAM tester to see if anything gets flagged?
@robbyjo commented on GitHub (Jan 13, 2025):
For some reason, I tried the following on Python:
import torch
torch.cuda.device_count() # returns only 1 instead of 2
I checked GPU-Z and my GPU0 was not marked as supporting CUDA (which is weird).
Edit: I must add that nvidia-smi somehow recognize both GPUs. Double weird. Why not CUDA?
@YonTracks commented on GitHub (Jan 13, 2025):
rick you are awesome, cheers for putting up with things, you have a good way nice!
does this mean this can be closed?
@rick-github commented on GitHub (Jan 13, 2025):
While I think this is a hardware not a software problem, we should wait to see if Roby can do a test and verify. I don't use Windows but I've heard that OCCT is good for this sort of testing.
@robbyjo commented on GitHub (Jan 14, 2025):
@rick-github and @YonTracks Ok this is weird. OCCT could detect the two cards

I'm frankly at loss. I tried downgrading to CUDA 12.1 or 12.3 but it doesn't work still (GPU0 still not recognized as CUDA)
@rick-github commented on GitHub (Jan 14, 2025):
Did you run a VRAM test?
@YonTracks commented on GitHub (Jan 14, 2025):
Device Manager is good also, to confirm no gpu/s issues and drivers are compatible and all that, I think in the properties of the display adapters the gpu will show if any issue, resources etc. Good luck.
funny enough, for me pci issues etc. might be a windows 11 thing, but I have an old riser and old rtx2060, so lol, my own issues. but, interesting I believe it was working. I wonder.
good luck.
@robbyjo commented on GitHub (Jan 14, 2025):
Hi @rick-github and @YonTracks Thank you so much for your help. OCCS did not detect any errors. However, Device Manager did show some warnings in PCI Device (unsure if it's related to the display). As I mentioned earlier, GPU-Z detected both cards, but GPU0 did not have the CUDA flag checked, while GPU1 had. Which is weird.
@robbyjo commented on GitHub (Jan 14, 2025):
By the way, I have my own OpenCL program and it worked across both GPUs.
@YonTracks commented on GitHub (Jan 14, 2025):
I hope I don't hinder.
but, yep, seems a pci bus issue and resources (example would be ssd M2 slot/s taking up pci resources also), windows 11 I'm guessing cuda is not happy with 1.1. gen and happy with 4.0. Try to fix / delete all the yellow triangles (they should just auto install next restart) and try trial and error, for devices needed etc, then (try to force x8 for both gpu's and check the M2 slots, if you can keep the first slot open, or slot2 check the bios).
I will be testing similar, in the next few days hopefully, will know this afternoon anyway (PC parts).
good luck.
update^: yep, I had similar issues, I had to remove a M2 ssd and make bios changes, or my PC would not even start with 2nd gpu.
I need better motherboard, I did not end up with any parts, I tried, I need to order online, oh well.
good luck
@rick-github commented on GitHub (Jan 15, 2025):
What happens if you switch the slots the cards are plugged in to? Does GPU 1 then become the one that garbles output and has no CUDA in GPU-Z?
@robbyjo commented on GitHub (Jan 16, 2025):
Hi @YonTracks and @rick-github . At present all evidence seem to point to hardware issue. I already updated the drivers to no avail. PCIE still has problems. Not sure if I have to swap GPU---pretty sure the GPU1 (which then becomes GPU0) would be garbled. I'm in no position to eliminate the second SSD or to replace motherboard at this point. Will try this some time in the future. Thanks for all your help!
@robbyjo commented on GitHub (Mar 3, 2025):
Hi @YonTracks and @rick-github. I managed to get both of my GPUs to work now (both CUDA options are on). I updated Ollama to the latest version (0.5.12). My two GPUs held some load, but the output still garbled.
@rick-github commented on GitHub (Mar 3, 2025):
What version of windows do you use? If you are OK with running a non-standard build I can have a go at generating a binary with some changes that may or may not make a difference.
@YonTracks commented on GitHub (Mar 3, 2025):
Howdy @robbyjo, sorry this is happening.
I feel for you. I don't want to hinder; Rick is awesome and can help way better than I or explain way better but needs good info.
Does it work correct at all?
is this docker? (not sure if it matters) lm-studio, and or cli.
is everything for ollama default, what env params have you tried?
OLLAMA_SCHED_SPREAD,OLLAMA_FLASH_ATTENTION,OLLAMA_GPU_OVERHEADCUDA_VISIBLE_DEVICESetc.what windows 11? how updated.
are the gpu's using risers or network etc?
what cuda 12.x and is the path env variables confirmed?
the latest server.log from 0.5.12+ are better also.
@YonTracks commented on GitHub (Mar 4, 2025):
ok so recap:
correct me if wrong, main issue is.
I would ensure/confirm the following,
system variables:
Pathcuda 12.1 and 12.3. above logs are showing both?also check,
CUDA_PATH_V12_3,CUDA_PATH_V12_1, or justCUDA_PATHalso, most likely has both also? I think this causes issues! It did for me, 12.1 and 12.4 for older ollama worked for me, but newer ollama, seems to need only 1 or proper config, I think 12.8 is best, not sure.I am currently using 12.8 with the toolkit only (I uninstalled everything else, and started with 12.8, so the env gets set automatically with only 1),
but I can't yet test multi gpu's srry, I got sidetracked lol.
@YonTracks commented on GitHub (Mar 4, 2025):
and remove all ollama env's to try use default, and add as needed, trying to confirm things, as you go.
@robbyjo below are the variables from above. possible issues. bugger! this was in the original message; I missed it so sorry.
OLLAMA_GPU_OVERHEAD:1572864000. remember to start fresh. should seeOLLAMA_GPU_OVERHEAD:0<<< should be most of the issue.check.
ollama should have a somewhat, working default.
good luck.
@robbyjo commented on GitHub (Dec 10, 2025):
The issue was my second GPU was faulty and I RMA-ed it. Apologies for all the troubles.