mirror of
https://github.com/ollama/ollama.git
synced 2026-05-05 23:53:43 -05:00
[GH-ISSUE #1800] OOM errors for large context models can be solved by reducing 'num_batch' down from the default of 512 #26787
Closed
opened 2026-04-22 03:23:27 -05:00 by GiteaMirror
·
11 comments
No Branch/Tag Specified
main
dhiltgen/ci
hoyyeva/editor-config-repair
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
hoyyeva/launch-backup-ux
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-mlx-decode-checkpoints
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
brucemacd/download-before-remove
parth/update-claude-docs
parth-anthropic-reference-images-path
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#26787
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @jukofyork on GitHub (Jan 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1800
Originally assigned to: @BruceMacD on GitHub.
I thought I'd post this here in case it helps others suffering from OOM errors as I searched and can see no mention of either "num_batch" or "n_batch" anywhere here.
I've been having endless problems with OOM errors when I try to run models with a context length of 16k like "deepseek-coder:33b-instruct" and originally thought it was due to this:
But whatever I set that to (even tiny fractions like 1 / 100), I would still eventually get an OOM error after inputting a lot of data to the 16k models... I could actually see the VRAM use go up using nvidia-smi in Linux until it hit the 24GB of my 4090 and then crash.
So next I tried "num_gpu=0" and this did work (I still got the benefit of the cuBLAS for the prompt evaluation, but otherwise very slow generation...). As soon as I set this to even "num_gpu =1" then I would get an OOM error after inputting a lot of data (but still way less than 16k tokens) to the 16k models.
So I then went into the Ollama source and found there are some hidden "PARAMETER" settings not mentioned in "/docs/modelfile.md " that can be found in "api/types.go" and one of these is "num_batch" (which corresponds to "n_batch" in llama.cpp) and it turns out this is was the solution. The default value is 512 (which is inherited from llama.cpp) and I found that reducing it finally solved the OOT crash problem.
It looks like there may even be a relationship that it needs to be decreased by num_ctx/4096 (= 4 for the 16k context models), and this in turn could possibly have something to do with the 3 / 4 magic number in the code above and/or the fact tbat 4096 is a very common default context size?? Anyway, setting to 128 almost worked unless I deliberately fed in a file I have created that I know deepseek-coder:33b-instruct will tokenize into 16216 tokens... So I then reduced to 64 and have since fed this same file in 4-5 times using the chat completion API so the complete conversation is > 64k tokens and it still hasn't crashed yet (the poor thing had a meltdown after 64k tokens and just replied "I'm sorry, but I can't assist with that" though lol).
I suspect I could get even closer to 128 as it did almost work but atm I'm just leaving it at 64 to see how I get on...
It should be noted that num_batch has to be >=32 (as per the llama.cpp docs) or otherwise it won't use the cuBLAS kernels for prompt evaluations at all.
I suggest anybody suffering from similar OOM errors add this to their modelfiles, starting at 32:
PARAMETER num_batch 32and keep doubling it until you get the OOM errors again.
@jukofyork commented on GitHub (Jan 5, 2024):
Just a quick update on other models that have different architectures.
Again I'm using my test file of ~16k tokens, a setting of
num_batch=64on a Debian 12 with 64GB ram + a 4090 with 24GB VRAM:codellama:34b-instructwith 16k context - passed.yi:34b-chatwith 16k context - passed.mixtral:8x7b-instruct-v0.1with 32k context and was fed the file 2x - passed.I will try
deepseek-llm:67b-chatwith it's context extended to 16k tomorrow and report back. I' don't have any other base models I can test on, but pretty sure I've solved my OOM problems now. nvidia-smi is showing around 21-23GB used of the 24GB at all times and it seems that I can now repeatedly fill the context until my LLMs have a meltdown 🤣@mongolu commented on GitHub (Jan 5, 2024):
Niceee!
10x, it resolved my problem (bumping into this too, oftenly).
I use 64 for num_batch now.
@jukofyork commented on GitHub (Jan 5, 2024):
Can you run a test and see if leaving it as 512 and setting
num_gpu=1still crashes for you?I'm beginning to suspect this is a problem with the wrapped llama.cpp server rather than Ollama itself...
If anybody else is getting these crashes and reducing the batch size fixes it; can you also run a test with
num_gpu=1and see if it still crashes with the default batch size of 512? I'll make a detailed post on their github if we can narrow it down a bit more.I've got to go out but I think we can also refine the
* 3 / 4magic number and possibly use more of the GPU now: somewhere I have bookmarked the formula used to calculate the KV working memory (and I tested to make sure it agrees with lamma.cpp main's output). In theory we should be able to use this instead of the magic number, but to do so will requite exposing some more of the fields read from the GGUF file toGpu.goto calculate it. I'm also not sure just how much, or if any, of the GPU VRAM is used for the cuBLAS batching and need to benchmark it.@jukofyork commented on GitHub (Jan 5, 2024):
I can confirm this page has the correct formula for calculating the KV cache:
https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices
KV cache size = batch_size * seqlen * (d_model/n_heads) * n_layers * 2 * 2 * n_kv_headsI did this calculation by hand for (IIRC) Llama-70b and a context length of 2048:
batch_size = 1 (NOTE: this is a different batch size to what we have here and is about serving multiple users from a A100)
d_model = 8192
n_heads = 64
n_layers = 80
n_kv_heads = 8
(1 * 2048 * (8192/64) * 80 * 2 * 2 * 8) / 1024^2 = 640MB
and this is exactly the same as what llama.cpp::main prints towards the bottom of it's output when run.
There are several wrong formulas floating about too:
https://old.reddit.com/r/LocalLLaMA/comments/1848puo/relationship_of_ram_to_context_size/
https://old.reddit.com/r/LocalLLaMA/comments/15825bt/how_much_ram_is_needed_for_llama2_70b_32k_context/
https://www.baseten.co/blog/llm-transformer-inference-guide/
Currently the function at the bottom of Gpu.go only gets passed the size of the model and the
n_layersvalue, but I assume it wouldn't be hard to change it to pass the other values from the GGUF file's header to it and do the proper calculation? IIRC, when I looked at the output of llama.cpp::main some things liked_modelwere named differently to the formula above though.This is from the new wizardcoder:33b-v1.1 model that is a fine-tune of deepseek-coder:33b-instruct which I just had the GGUF file handy for looking at:
d_model <--> n_embd = 7168 (which I think is also: n_head * n_rot)
n_heads <--> n_head = 56
n_layers <--> n_layer = 64
n_kv_heads <--> n_head_kv = 8
So redoing the calculation for a 16k context size:
Using the 3/4 magic number from Gpu.go on my 4090:
24GB VRAM = 24×1024 = 24576MB
24576 * 3/4 = 18432MB
18432 + 4096 = 22528MB
When running the model on my ~16k token file (with num_batch=64), nvidia-smi is showing the same use the whole time (ie: for both prompt evaluation and for generation):
21892MB / 24564MB
and this ties in with the above as the integer division in the
gpu::NumGPU()function will be rounding down the number of layers.I don't really know enough about cuBLAS to know if it needs any VRAM to run the prompt evaluation though, but from this is doesn't look like it does (?).
These are the nvidia-smi stats for 8192 and 4096 contexts sizes for reference:
Using 8192 contexts size: 20480MB / 24564MB
Using 4096 contexts size: 20238MB / 24564MB
Which should have a KV cache size of 2048MB and 1024MB respectively, yet the Gpu.go function will just be allocating 3/4 of the 24GB for the offloaded layers and the extra VRAM must be getting used by cuBLAS (?).
So it's not 100% clear what's going on and it's probably worthwhile doing some benchmarks to see how to incorporate the KV cache size formula properly for those of us running with much smaller or much larger context sizes to utilize our VRAM as best as possible.
Anyway hope this is useful for somebody to work on refining the
gpu::NumGPU()calculation eventually.I just tested a 32k context model and right enough it did crash with this error:
Error: Post "http://127.0.0.1:11434/api/generate": EOFSo quite clearly
gpu::NumGPU()should be dynamically calculating the layers better and the 3/4 magic number is only working through luck most of the time (and possibly wasting VRAM for those running with < 4096 context too...).So looking at the code to see how hard it would be to change:
llm::New()has access toggmlwhich contains the required variables. Then the chain goes:gpu::NumGPU()would also need to be passed the context length, but inext_server::newExtServer()it gets this a couple of lines down anyway:In
gpu::NumGPU()you would need to use the formula above (possibly with some extra subtracted for cuBLAS as mentioned).I'd do a pull request but I know nothing about Go and it would probably be a bodge-job considering so many different variables need passing up the chain... I think the best solution might be to calculate the KV cache size for a context length of 1, pass this up the chain to
ext_server::newExtServer(), multiply it by thesparams.n_ctxvalue and then pass this as an extra parameter togpu::NumGPU()to use. Hopefully somebody can try this, but if not I'll have a go, but would be much happier if somebody familiar with the codebase and Go did it.@jukofyork commented on GitHub (Jan 5, 2024):
Back to the original problem... I've found a good way to find the optimal value of
num_batch:num_gpumanually to something fairly conservative so it's using around 1/2 to 3/4 of your GPU's VRAM.nvidia-smiand watch the VRAM usage.The VRAM usage should go up rapidly at the start and then stabilize all the way through processing the huge file.
Write down the VRAM usage from
nvidia-smiwhen it settles and then wait until it either crashes OOM or the prompt evaluation stage is over and it starts outputting text (likely to be gibberish or it might just end without saying anything, because you've overloaded the context...).If you have set
num_batchtoo high then the VRAM usage will have gone up by now (assuming it hasn't crashed OOM already).Try to find 2 values where one works and the other doesn't and just keep bisecting them:
[64, 128] --> (64+128)/2 = 96 [BAD]
[64,96] --> (64+96)/2) = 80 [GOOD]
[80,96] --> (80+96)/2 = 88 ...
and so on.
Eventually you will find the sweet spot where you can't raise it anymore without VRAM starting to leak.
Then leave
num_batchfixed at the good value and start raisingnum_gpuuntil you get OOM errors (this should happen as soon as the model loads now).You should then have optimal
num_batchandnum_gpusettings for that particular model and any fine-tunes of it.I've just done this with
deepseek-coder:33b-instructand gotnum_batch = 86andnum_gpu = 52:🤣
It will be interesting to see if
num_batch = 86is constant for other base models like LLama 2 or Yi.You might also want to kill the ollama process between each test as it's not clear sometimes if it's actually reloaded the new value and/or sometimes it seems to go into a CPU-only mode where it doesn't use cuBLAS at all (ie: GPU use stays at 0% in
nvidia-smiand it takes an etremetely long time to run the prompt evaluation stage).@mongolu commented on GitHub (Jan 6, 2024):
Before putting num_batch=64, i haven't had this param in modelfile, but I've tried with num_gpu=1 and still crashed.
Pretty impressive work you've done.
I'm sorry, i don't quite follow you, maybe others more experienced.
Right now, I'm happy that it works, without crashing, till now.
@jukofyork commented on GitHub (Jan 6, 2024):
I've managed to tune for deekseek-coder, codelama and yi base models now and it seems really random with optimal values using a 16k context length ranging from 80 to 180.
It does seem that fine tuned versions have almost the same optimal value but not necessarily exactly the same, so I've chosen to round down to the previous multiple of 16 for safety.
I can run nearly anything with a context length of 4096 and default the batch size of 512, apart from Mixtral that needs 256.
Mixtral still leaks memory and crashes with a 32k context length on the lowest allowable batch size of 32 if I give it a really massive file.
I'm going to retry with Q8 and Q6_K models later and see if they are any different to the current Q5_K_M models - there is some chance these use a different code path in llama.cpp and might avoid whatever is leaking VRAM.
@jukofyork commented on GitHub (Jan 6, 2024):
Yeah, I was having to use num_gpu=0 and had really slow generation (but still fast prompt evaluation from using cuBLAS). I'm getting a lot more usable generation now but the prompt evaluation is slower than it was...
Until this gets fixed I'm going to have 2 copies of each model: a 4k context with 512 batch size and a 16k context with the maximum non-OOM batch size, and choose between then based on the task (4k for small discussion prompts and 16k for large sourcecode ingestion prompts).
@jukofyork commented on GitHub (Jan 6, 2024):
Update:
Tried
deepseek-coder:33b-instruct-Q8_0and same problem...@jukofyork commented on GitHub (Jan 8, 2024):
Update: I've just moved not to using lower K-quant models if I want > 4k context. This buffer leak seems to only happen when increasing the context. I can still run 4k context models fine using mix of CPU and GPU.
@jmorganca commented on GitHub (Mar 12, 2024):
Hi folks if it's okay I'm going to merge this with the ongoing OOM + batch size issue: #1952