mirror of
https://github.com/ollama/ollama.git
synced 2026-05-07 00:22:43 -05:00
[GH-ISSUE #12010] Feature: (Re)introduce functionality for manually overriding layer splitting and GPU offload decisions #70034
Closed
opened 2026-05-04 20:06:34 -05:00 by GiteaMirror
·
22 comments
No Branch/Tag Specified
main
hoyyeva/anthropic-local-image-path
dhiltgen/ci
dhiltgen/llama-runner
parth-remove-claude-desktop-launch
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.30.0-rc3
v0.30.0-rc2
v0.30.0-rc1
v0.30.0-rc0
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#70034
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @gordan-bobic on GitHub (Aug 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12010
Removal of --tensor-split on 0.11.5 is a MASSIVE leap backward. The allocation split calibration heuristic is beyond terrible. With multiple GPUs and a large model, the heuristic split results in context size with full GPU offload being as low as 50% of what can otherwise be achieved.
Example:
Model: llama3.2-vision:90b, 101 layers
VRAM: 4x 22GB
layer-split (with 0.11.4): 26,25,25,25
Maximum num_ctx length before 1 layer is moved to CPU: 14094
This is vastly sub-optimal. Overriding the layer split to
24,27,27,23we can achieve a fully populated context size with num_ctx ~29700 without CPU offload and without OOM.This is not a small difference, this is more than 2x, and with default split 15-16GB of VRAM remains unused.
Without the layer-split being passed, this is difficult to override.
If you are going to insist on removing the --layer-split paremeter, then at the very least make it overridable configurable in some way.
Difference bewen 14,000 and 29,000 of usable context on same hardware is not a small difference, it is a difference between unusuable for most tasks and comfortably usable for most tasks.
@rick-github commented on GitHub (Aug 21, 2025):
0.11.5 changes how a model is scheduled across multiple GPUs with a view to decreasing memory overhead and power consumption. Server logs may aid in debugging.
@gordan-bobic commented on GitHub (Aug 21, 2025):
Log shows it splitting the exact same way as 0.11.4 for this model, 26,25,25,25. 0.11.5 behaves no differently in terms of split in this case, except it takes away my ability to override it using an injected redneck script that gets it to optimal settings.
Is there some other way/place to override it?
@rick-github commented on GitHub (Aug 21, 2025):
Unfortunately not currently.
The preferred solution would be to have ollama find the optimal layer allocation so that redneck scripts are not required. Server logs (preferably with
OLLAMA_DEBUG=2to see layer assignments, this would also include prompts that might need redacting) would help in debugging.@gordan-bobic commented on GitHub (Aug 21, 2025):
I have seen years of pain arise from such assumptions, only for the problem eventually be fixed by capitulating and providing and proper override method.
Here is what it looks like when left to it's own devices. No prompt neeed, just set num_ctx to more than about 14100.
But force it with an override to offload all 101 layers and set the split to 24,27,27,23, it will consistently handle up to a little over num_ctx=29000 even if you actually fill it up with 29000 tokens of lorem ipsum (just ask the model to generate as much of it as possible repeatedly, if you don't want to paste in a few large chunks of it). Without any context payload, it will allocate with up to about 33000 num_ctx in VRAM without OOM, but as the context gets full it starts to need enough memory to make CUDA OOM. 29000 (actually up to about 29700) works even with the context full to the brim.
I'm probably going to look into getting a feature implemented at some point for config file with overrides. If/when that happens, happy to submit a PR if there is interest in incorporating it. Expecting the heuristic to always get it sufficiently perfectly right to never need an user override mechanism is, IMHO, a recipe for long term disappointment.
@jessegross commented on GitHub (Aug 21, 2025):
I would recommend setting OLLAMA_NEW_ESTIMATES=1. The changes that you seeing are the result of an overhaul of the memory allocation code, which significantly improves layouts, especially on multi-GPU systems. If it doesn't work well, please post the logs here. It will become the default behavior in the near future.
The communication between the Ollama server and runner is an internal API. It's not something that we can guarantee compatibility on between versions. And to be completely honest and to save you some disappointment, the PR that you are describing with a config file is not likely to be accepted.
@gordan-bobic commented on GitHub (Aug 21, 2025):
With
OLLAMA_NEW_ESTIMATES=1it doesn't actually seem to report how it split the layers, but I can confirm that it offloaded all 101 layers to the GPU. Looking at the memory usage on the GPUs, it is not using the optimal split because what works with my explicit24,27,27,23split without OOM-ing, actually OOMs withOLLAMA_NEW_ESTIMATES=1.That means that
OLLAMA_NEW_ESTIMATES=1introduces instability due to OOM in addition to choosing a sub-optimal layer split. If I had to guess by looking at the memory usage, it decided on 25,27,27,22 which resulted in OOM on device 0 because device 0 in this case also runs the Xorg console, which seems to eat an extra GB of VRAM on that GPU, which ollama probably doesn't account for.@jessegross commented on GitHub (Aug 21, 2025):
Please post the logs, ideally with OLLAMA_DEBUG=1
@gordan-bobic commented on GitHub (Aug 21, 2025):
Here is the OLLAMA_DEBUG=2 log with OLLAMA_NEW_ESTIMATES=1 set and the resulting OOM. When it OOM-ed on GPU 0, there was memory to spare on the other GPUs.
Input is 166 paragraphs of lorem ipsum generated using this: https://www.lipsum.com/, which totals a little over 30,000 tokens with
llama3.2-vision:90bOLLAMA_NEW_ESTIMATES.log.gz
@gordan-bobic commented on GitHub (Aug 21, 2025):
For reference, here is the log from my custom optimised layer split override that works fine without OOM-ing on 0.11.4. I use a wrapper script to override the
--n-gpu-layersand--tensor-splitparameters whenollama runneris invoked (and that no longer works with 0.11.5.ollama-custom.log.gz
@jessegross commented on GitHub (Aug 21, 2025):
Thank you for the logs.
OLLAMA_NEW_ESTIMATES logs the layer allocations in a different format but if we map them back to the traditional way, I see that it is calculating 21,27,26,27. As you point out, X is using extra memory but that is being taken into account. In fact, it is actually more conservative in this regard compared to your manual settings while still offloading all of the layers.
Note that in your custom version, the context length is 4096 whereas on the new estimates version it is 29696. My guess is that if you used the same context length on your version and also filled it with text, you would see the same OOM.
This OOM has become a more prominent issue as we have fixed many of the existing issues using the new memory estimates (see #11753). As a workaround, you should be able to avoid it by setting OLLAMA_FLASH_ATTENTION=1 in addition to OLLAMA_NEW_ESTIMATES=1.
If you are curious, you can see the layouts used by the new estimates in the following log:
Aug 21 22:14:21 overmind ollama[203987]: time=2025-08-21T22:14:21.430+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:29696 KvCacheType: NumThreads:24 GPULayers:101[ID:GPU-ea5a8e95-7eb3-0b4b-e92c-688d62f8fe3f Layers:26(0..25) ID:GPU-87e6e1e0-407a-eb0f-ea9e-fd078f436174 Layers:27(26..52) ID:GPU-556dc4f5-3b5e-b765-dbed-6e3c2836ecc7 Layers:27(53..79) ID:GPU-5e21c94b-9945-57c0-78a9-c884f4d29e1a Layers:21(80..100)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"You can see the layer counts in addition to the GPU IDs. They may not be in the same order as you previously saw but you can reorder by them by matching the IDs to these log lines:
@gordan-bobic commented on GitHub (Aug 21, 2025):
That was on the 2nd iteration, if you look at the first response in the log, it actually used 29696 context length. The reason why one part has shorter, 4096 token context is to do with now Open-WebUI works, because it creates short labels for chats based on the first 4K tokens, but the real response uses the full context. I tested it multiple times, and with my custom split it definitely uses the full 29Ki context size and it definitely doesn't OOM on ollama 0.11.4 where I can apply my override.
Thanks for explaining how the layer split is described in the new log, that is very helpful. I don't know, then, why my split doesn't OOM in 0.11.4. It could be that there is another change in 0.11.5 that is causing the OOM. I would expect the OOM with the OLLAMA_NEW_ESTIMATES split to occur on GPU 3 instead.
I can confirm that OLLAMA_FLASH_ATTENTION does seem to reduce GPU memory usage slightly, but this seems to to be happening due to the context being reduced to a little over 27,000 tokens, no matter how long the chat history is and how high I set the
--ctx-size.@jessegross commented on GitHub (Aug 21, 2025):
I see the first iteration in your logs with the larger context now.
One more piece of the layer allocations that I didn't mention is that the new estimates also explicitly specify which layers to offload on which GPU so it is no longer a simple in-order mapping. As a result, even though there are fewer layers on GPU 0, some of the layers are larger. Adding up all of the allocations on your version, I see 18G, whereas with the new estimates it is 18.4G. Probably the old default ordering enables a slightly more even packing than the new one but that's mostly just luck as it is very close.
Flash attention should not reduce the maximum context length, where are you seeing that? Does it prevent the crashes when used with the new estimates? It should reduce memory usage by avoiding some of the allocations for intermediate states.
@gordan-bobic commented on GitHub (Aug 21, 2025):
The logs don't seem to show any out-of-order mapping, though:
Unless I am misunderstanding what you meant by that. I would also expect multiple passes back and forth between GPUs in a single cycle would be less efficient, than sequential layer splitting.
That's what I thought, but regardless of the amount of data in the context and the set ctx-size, the prompt_tokens/total_tokens as reported by Open-WebUI never exceeded about 27K, and I can see from the ollama runner parrameters (in 0.11.4) that ctx-size was set to 37K at the time. If I disable flash attention, it seems to go all the way up to the calibrated maximum I can achieve without OOM on 0.11.4 of about 29696.
But debugging OLLAMA_FLASH_ATTENTION behaviour is probably worthy of a separate ticket.
@jessegross commented on GitHub (Aug 22, 2025):
Mapping the IDs to names you will see this is CUDA2, CUDA1, CUDA3, CUDA0. Previously it was always CUDA0, CUDA1, CUDA2, CUDA3.
It's still sequential, just a different ordering. We don't consider PCIe topology and maybe the original enumeration order will group things better, though it probably doesn't make a difference for most consumer PCs. The new ordering is mostly an artifact of the allocation system. Using the original order would help your case but that's likely just luck.
@gordan-bobic commented on GitHub (Aug 22, 2025):
Ah, I understand what you were referring to now.
The original point remains, though - there is definitely value in being able to override the
--n-gpu-layersand--tensor-splitparameters, because heuristics are never infallible.@morgwai commented on GitHub (Sep 1, 2025):
I have an asymmetric GPU setup: RTX-3090 24GB and GTX-1080ti 11GB, so in case of models sized 24-30GB I want to put as many layers as possible on 3090 and only the remaining ones on 1080ti for obvious performance reasons. By default
ollamasplits them to obtain roughly the same percentage of VRAM utilization on both cards (for example 18GB,8GB), which is waaay slower than if I specify--n-gpu-layersand--tensor-splitmanually as described by @gordan-bobic (thanks for your Altechnative article, man!).Sometimes I need an even more elaborate tuning, for example I need a specific amount of VRAM to be left on a specific card to run some other stuff there.
Several ppl have asked for similar configuration options (for example 10172) or described how tuning these pre 0.11.5 params can improve performance (and I could keep providing more and more links of course...), but
ollamateam keeps insisting that they know better how ppl should run their workloads... Not everyone is Steve Jobs to get away with such an attitude: ppl will just migrate to llama.cpp, vLLM or ExLlamaV2 or whichever engine gives most flexibility (I'm just investigating it myself ATM).@gordan-bobic commented on GitHub (Sep 26, 2025):
The latest version (0.12.2) is spectacularly, hilariously bad at figuring out the split.
With a manual split on 0.11.4 I run hermes4:70b with full 128K context on 4x22GB GPUs with full 80/80 layer GPU offload.
With 0.12.2 it offloads only 18/80 layers to the GPU which makes the whole system completely unusable.
Removing options for manually overriding auto-detected settings is NEVER a good idea.
@rick-github commented on GitHub (Sep 26, 2025):
Server logs may aid in debugging.
@gordan-bobic commented on GitHub (Sep 26, 2025):
This ticket isn't about debugging layer splitting heuristics, it is about (re)adding a feature to facilitate manual override.
@jessegross commented on GitHub (Sep 26, 2025):
There was never functionality to allow manual control of the layer assignments, as the interface being manipulated by the scripts described here is internal and not publicly exposed.
Offloading will only get more complicated over time as we optimize memory usage and we don't want an ever expanding API, so we don't plan to more controls than we currently have.
@gordan-bobic commented on GitHub (Sep 26, 2025):
As you have been optimizing memory usage, things seem to have been getting worse rather than better. I'll get a feature for this implemented and pull requested.
@chrisoutwright commented on GitHub (Sep 28, 2025):
Also have completely uneven splits with same vram gpu: llama-3.3-nemotron-super-v1.5-q4km:49b
https://github.com/ollama/ollama/issues/7047#issuecomment-3342197917
using: (new Estimate or SchedSpred changes do not help)