mirror of
https://github.com/ollama/ollama.git
synced 2026-05-07 00:22:43 -05:00
Closed
opened 2026-05-03 23:01:04 -05:00 by GiteaMirror
·
30 comments
No Branch/Tag Specified
main
hoyyeva/anthropic-local-image-path
dhiltgen/ci
dhiltgen/llama-runner
parth-remove-claude-desktop-launch
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.30.0-rc3
v0.30.0-rc2
v0.30.0-rc1
v0.30.0-rc0
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#65876
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @abes200 on GitHub (Aug 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6148
What is the issue?
I did see an issue this was mentioned in it but it was closed and said as fixed in version 0.2.1
I wasn't having this issue when I was using 0.3.0. I missed a few updates, but updated to the most recent and now if I have OLLAMA_NUM_PARALLEL in my system variables or use it as an option in python using Ollama the model reloads for every request that is sent.
Just to clarify, using Ollama CLI in windows:
Ollama run gemma2
send a message
send another message
Remove OLLAMA_NUM_PARALLEL from system variables and model loads and responds as normal.
Models seem to take a little more memory to load when I have OLLAMA_NUM_PARALLEL in my system variables than without it.
However whether I have it in or not, I am no longer able to do parallel requests on models. Without it requests are now always queued, with it they are still queued but the models unload and reload before each response.
Have I missed something obvious somewhere? That does happen a lot.
OS
Windows
GPU
Nvidia
CPU
AMD
Ollama version
0.3.3
@rick-github commented on GitHub (Aug 3, 2024):
Server logs may help in debugging. I just tried it and there was no model re-load between messages.
Parallel requests were also handled correctly (
OLLAMA_NUM_PARALLEL:2):@abes200 commented on GitHub (Aug 3, 2024):
I've run it with debug enabled and compared a log between parallel=1 and 2. Everything is basically the same up to this point:
parallel = 2
parallel = 1
For some reason whenever parallel is set to anything higher than 1 it now immediately clears the model to make room. I would include the entire log, but as it contains a lot of information about my personal computer I'd rather not share publicly. I had to remove my username and GMT from the small bits above.
So this was working until a few days ago when I updated to the latest version. Now suddenly it just immediately clears the model to make room for a new one on every request. Very frustrating as a python program I was working on requires multi threading to work correctly, and now suddenly I have the model reloading 3 times on every message instead.
If however I use a tiny model that I can load multiple ones of, it works just fine. It's like for some reason now it checks if there is enough memory to load a second model and if there is then it just goes ahead with the request. It doesn't load a second model or make any other changes to memory. But if there isn't enough memory left to load a second model it clears the first and loads a new one before it responds.
Just to be clear, the model has finished loading and I still have a few GB's of memory left to spare, but the model unloads and reloads every time before it then responds.
I was hoping there was something obvious I have missed or something that changed how I should be using OLLAMA_NUM_PARALLEL in some way.
Also, I did notice someone else mentioned they had this issue on the closed issue about 3 weeks ago, that would be a little after I updated last. LINK
If anyone figures out what stupid thing I have done this time please let me know. Thanks!
@rick-github commented on GitHub (Aug 3, 2024):
Server logs really would make this easier. At the very least,
egrep " POST|ctx"from the logs.@abes200 commented on GitHub (Aug 3, 2024):
egrep " POST|ctx"
I do not know what that is sorry.
@rick-github commented on GitHub (Aug 3, 2024):
findstr /c:" POST|ctx" logfile@abes200 commented on GitHub (Aug 3, 2024):
Right... so I am correct in assuming you are looking for a string " POST|ctx" in the log file?
It does not exist.
findstr /c:" POST|ctx" server.logis what I am assuming you mean... searching the log file for that string? That command results in literally nothing happening.
I also searched for "POST" in a server log that reloaded, here are the lines where the word POST are mentioned:
It really would be easier if I could just share the whole log. Perhaps it could be changed so when it creates the log it doesn't include things like my windows username, GPU id, complete path lists, and my timezone. I'm not sure if there are other things specific to my computer/data/location included in the logs which could be considered personal information.
There are these lines which show an error, although I am not sure what that means. There does not seem to be any other errors occurring in the logs.
Although those lines are also in the logs where the models do not reload when I have OLLAMA_NUM_PARALLEL = 1.
@rick-github commented on GitHub (Aug 3, 2024):
"POST|ctx"is an OR operation. I'm looking for lines that contain " POST" and "ctx" and the order in which they occur. Looking for POST with leading space will limit the potential PII. I haven't used windows for a long time so I'm sorry the command wasn't quite right, doesfindstr /c:" POST" /c:ctxgenerate output?@abes200 commented on GitHub (Aug 4, 2024):
Ah, I see what your after. God I wish I could say I haven't used windows in a long time...
findstr /c:" POST" /c:"ctx" <logfile>works in getting what your after.. I hope.To help ensure consistency in the test I performed 2 new test identical to each other except I changed OLLAMA_NUM_PARALLEL from 2 to 1 between the first and second test.
Started with Ollama App closed, loaded from start menu. "Ollama run gemma2", wait for model to load, say "Hello" and wait for response. "/bye" and then close Ollama App. Then I ran
findstr /c:" POST" /c:"ctx" server.log. Changed parallel from 2 to 1 and repeated.These are the results from the log file which is the one that has parallel at 2 and reloads the model:
These are the results from the log file with parallel set to 1:
I did not make any changes to context sizes or settings of any kind except to the parallel variable in system variables, and I did definitely only say "Hello" to the model once on both occasions and nothing else.
@rick-github commented on GitHub (Aug 4, 2024):
--n-gpu-layerschanged between runs for OLLAMA_NUM_PARALLEL=2. What doesfindstr /c:layers.modelshow? And what's the output ofnvidia-smi?@rick-github commented on GitHub (Aug 4, 2024):
In the meantime, you may be able to stop the model reloading behaviour by explicitly setting the number of layers to be offloaded, see https://github.com/ollama/ollama/issues/5913#issuecomment-2248262520.
@abes200 commented on GitHub (Aug 4, 2024):
layers.model for the log with the parallel at 2 is:
for for parallel 1 is:
The only reason is should be making any changes to how many layers or anything that is loaded, is due to its own internal settings and auto detection now.
Actually most of the outputs in nvidia-smi show N/A. Although I do just want to restate that this was working about a week ago before I updated the Ollama app and this started happening immediately after the update. As in I was working with a model, decided perhaps I should get that new update, installed the update, and then the model started reloading for every response.
As a note, I have noticed that it does occasionally change how many layers get sent to the GPU from one load to the next. It's possible there are some background tasks in windows which might alter it a little. I have not noticed a large difference in performance for the most part.
I attempted the solution of setting the gpu layers but that doesn't seem to work. I have tried 24, 25, and 27 so far. I tried setting it using
/set parameter num_gpuwhile the model was running, and also setting it in a model file and creating a new model from gemma2.The models still keep reloading if parallel is present in the variables and set to anything other than 1 or 0.
@rick-github commented on GitHub (Aug 4, 2024):
Do you have
OLLAMA_KEEP_ALIVEset in the server environment (findstr /c:OLLAMA_KEEP_ALIVE server.log), or do you have akeep_aliveoption in your API call from your python script?@abes200 commented on GitHub (Aug 4, 2024):
I pulled
OLLAMA_KEEP_ALIVE:10m0sfrom the log file.Also I do have that set in the system variables, yes.
@rick-github commented on GitHub (Aug 4, 2024):
What does
findstr /c:sched.go server.logshow?@abes200 commented on GitHub (Aug 4, 2024):
Using that on the original log file that reloads the models:
And for the log that did not reload the model:
@abes200 commented on GitHub (Aug 4, 2024):
After looking through the code on github, I think I can say with some confidence, I have no idea what I am looking at. My python skills are no where near good enough to work it out.
I thought maybe it was something to do with the
needsReloadfunction, but then it should be doing it for everyone. So it must be an edge case of some kind.Could it have something to do with how it calculates the free vram on my card specifically when parallelism is occurring?
Graphics: NVIDIA GeForce GTX 1660 SUPER, compute capability 7.5, VMM: yes
CPU: AMD FX-8350 4.0ghz
Mem: 16gb
I question this as if I set the num_gpu to 0, I can run models in parallel with no problem and get 2 response at once and the model does not reload unless it is supposed to now. Although it's quite a bit slower.
I'll just add that currently the only solution is to disable any parallel calls, although this means everything is now queued, it does at least mean the model is not constantly reloading and queuing things anyway.
I've tried changing and adding multiple different options, `OLLAMA_NUM_PARALLEL' is the only option that causes the models to reload unless it is set to 0 or 1. If set to 2 or higher the model will reload for every message and only ever respond to 1 message at a time.
Is there anyone else having this issue or am I completely alone? If I work out what stupid thing I have done to break it I'll post it.
@rick-github commented on GitHub (Aug 5, 2024):
You are right in that
needsReloadseems implicated here. I added some debugging over the weekend but was unable to see anything unusual, somebody who's actually experiencing the problem will have to dig around in the code. It is very curious that only a small subset of people who trigger that code are affected, so far I have no explanation.@abes200 commented on GitHub (Aug 6, 2024):
It's a bit of a pickle. I really appreciate your attempts to find a solution. Thank you for the help.
I tried to keep looking myself, but adding debug code and testing it is a bit beyond me.
I tried a few things over the weekend as well, but nothing as interesting as you. Tried different combinations of settings, different models, even updated my drivers.
All I can really confirm is that if OLLAMA_NUM_PARALLEL is 0 or 1 it works fine with no reloading, and that higher numbers also make it use more memory when it loads the model but will make it unload/load the model on every input. Nothing else seems to have any impact on this.
Except for 1 thing. If I load a very small model like Qwen2 0.5b of which I could fit more than 1 in memory, then the model works and does parallel requests just fine. No reloading, and it doesn't load a second model. If I try Qwen2 7b which I can only fit 1, it reloads on every request.
Also that is specific to my GPU memory, as in if I can fit more than 1 model in GPU memory, it will not reload and works fine. If only 1 fits, it will reload.
Except why that is occurring only with parallel I have no clue.
I did just run another test, if I expand Qwen2 0.5b num_ctx to 65536, it takes up most of my GPU memory so now I can only load one of them. This caused the model to start reloading like the 7b model does on every request as well, but again only with parallel>1. Not sure if that's helpful, but just thought of it before I posted, so thought I'd share the result.
@MHugonKaliop commented on GitHub (Aug 6, 2024):
Hello
I have the same issue, and a different configuration : Linux / Ubuntu / Cuda / Nivida L4 with 24 Go VRAM
ollama 0.3.3 with these models :
I can't say for sure what was the ollama version that began to have this issue, but I don't think it was before 0.3.0 because the issue is new for me.
My initial env variables were :
I've first had issue with Codestral not offloading all layers on GPU, resulting in very low response times.
Now my env variables are :
Codestral correctly offloads all layers to the GPU, using around 18 Gb of VRAM, and the thoughput is back to the usual.
But calling an embedding unloads Codestral every time (it's then reloaded the next call).
The log shows this :
And I've just witnessed something else.
Codestral is correctly loaded, and /api/generate is called regularly.
But if I send a simple embed query from the command line using the same codestral, it's offloaded !
@rick-github commented on GitHub (Aug 6, 2024):
@abes200 It really feels like a memory pressure problem. I know you tried adjusting
num_gpubut in the example you gave, you are increasing the layers. To decrease the memory pressure you need to decrease the layers, have you tried with/set parameter num_gpu 20?@MHugonKaliop Your context window is changing between calls. You can add
"options":{"num_ctx":x}to the calls to maintain a consistent window size, or if you can't modify the API calls, you can create a new model with a default context window by addingPARAMETER num_ctx xto the modelfile.@MHugonKaliop commented on GitHub (Aug 6, 2024):
@rick-github
you were right, it explains the offloading when using /embed for the same model.
But I still have the issue when a call goes to the other model. The first one is removed from VRAM.
@rick-github commented on GitHub (Aug 6, 2024):
There's not enough context in your log snippet to see what's going on, can you post the full server log?
@abes200 commented on GitHub (Aug 6, 2024):
@rick-github
That is a good idea... I did notice it was adjusting the layers automatically as I changed the settings, lower it as I increased num_ctx and vice versa. I think I just assumed it was working that out correctly.
After some testing it seems it's miscalculating the layers by 1. I tried changing the num_ctx up to adjust the layers it was loading, every time if I reduced it by 1 from what was in the log file it stops reloading.
So while I'm still puzzled why it's not doing that for everyone, or why it suddenly started doing that recently.
It does allow me to make a fix for myself! Thank you! Ollama is awesome so I am happy again.
I'll leave this open for now until Ollama works automatically again(unless requested to close it), in case I discover it is something I did or you want me to try anything else.
@MHugonKaliop commented on GitHub (Aug 6, 2024):
Here we go with the logs with ollama_debug=1
Initialisation phase :
Then comes a first call (an ambedding), Codestral loads
I've noticed that it offloads 53/57 layers to GPU.
then comes a call to api/generate (same model)
At this point, there will be another call to api/generate, and the the model will unload
@MHugonKaliop commented on GitHub (Aug 6, 2024):
I have another server with ollama v0.3.0, and codestral:latest only asks for 15GB
On this server, I have room left for llama3:latest
@rick-github commented on GitHub (Aug 6, 2024):
Your GPU has 21.8G free:
The model requires 18.4G for the weights and because you have a large context window (32768), another 7GB is required for KV space. 25G won't fit into the available space, so ollama loads as much as it can into VRAM, 21.5G. This leaves <1G free.
Now you do the embed call. The model requires 1G for weights, KV space, and the graph:
ollama can't fit both models:
So it unloads one model in order to load another. When an API call is made to the previous model, it unloads the current one and reloads the old one.
There are several ways to mitigate this.
num_gpuparameter for the codestral model to some number less than the number of layers that ollama computes it can offload. This leaves room for loading the smaller embedding mode with the cost of slightly slower inference by codestral.num_gpuparameter for the embedding model to zero. This will cause the model to only run in system RAM using the CPU, at the cost of slower embeddings.The trade off between 2 and 3 is where you think most of the inference time is spent. If you are doing a lot of embeddings. then having codestral run a little bit slower may be acceptable. On the other hand, if embedding calls are infrequent, you want to let codestral have all the VRAM and let the embedding model make do with CPU.
There are two ways to adjust the number of layers offloaded. You can either add
"options":{"num_gpu":50}to the API calls, or you can set a default value in the model itself:Same holds for setting the
num_gpuvalue for the emebdding model.In the case where the other server only takes 15G for codestral, I suspect that the context window is smaller than the original server.
It can be argued that the model scheduler could be smarter, and the ollama team have mentioned this. But because the scheduler can't know the workload, it will sometimes be wrong, and doing manual overrides like above are sometimes necessary.
@MHugonKaliop commented on GitHub (Aug 6, 2024):
Very interesting, thank you so much for the time you spent for this detailed explanation !
I wasn't aware of the size needed for the context window !
Makes perfect sense now
@trevorboydsmith commented on GitHub (Jan 2, 2025):
just an FYI. in my case i was seeing the log file for my ollama v0.4.1 unload and reload a lot. i am doing CPU only processing with
llama3.2:3b. i saw here he is doing embed and so am i. i also do /api/generate with context size that is32*1000. my/api/embeddid not set the options context value. so i changed my call to /api/embed with the options set to the same context size of32*1000as the /api/generate and now my log file shows that the model no longer unloads and reloads --> all fixed.i did see this line in the server log:
so... this log file message looks like (i'm no expert on reading this log file here) something about not having enough memory... hence the constant unload/reload inbetween /api/generate with num_ctx=32000 and /api/embed .
@rick-github commented on GitHub (Jan 3, 2025):
You are correct in your analysis that the missing
num_ctxcauses the unload/reload. This will be fixed if #8029 is merged.@ghost commented on GitHub (Feb 7, 2025):
a400df48c0/server/sched.go (L147)a400df48c0/server/sched.go (L603)It seems that there is a design in the scheduling logic: if the parameters of two requests are inconsistent, it will trigger a model reload. I am currently encountering frequent restart issues, and the scenario I am reproducing is: sending two requests with inconsistent parameters to the same model will definitely cause the model to reload. I don't know what the original intention of this design is; it's a bit difficult to understand.
@rick-github @abes200