mirror of
https://github.com/ollama/ollama.git
synced 2026-05-06 16:11:34 -05:00
Closed
opened 2026-04-12 10:48:15 -05:00 by GiteaMirror
·
123 comments
No Branch/Tag Specified
main
dhiltgen/ci
parth-launch-plan-gating
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#1063
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @jadhvank on GitHub (Jan 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1863
Originally assigned to: @jessegross on GitHub.
I updated Ollama from 0.1.16 to 0.1.18 and encountered the issue.
I am using python to use LLM models with Ollama and Langchain on Linux server(4 x A100 GPU).
There are 5,000 prompts to ask and get the results from LLM.
With Ollama 0.1.17, the Ollama server stops in 1 or 2 days.
Now it hung in 10 minutes.
This is the Ollama server message when it stops running.
It happens more when Phi 2 runs then when Mixtral runs
After the freeze, exit the server and run it again, then the prompt and the LLM answer is successfully received.
The environment

Linux: Ubuntu 22.04.3 LTS
python: 3.10.12
Ollama: 0.1.18
Langchain: 0.0.274
Mixtral: latest
Phi 2: latest
GPU: NVIDIA A100-SXM4-80GB x 4
Prompt size: ~10K
# of Prompts: 5K
Read these articles, https://github.com/jmorganca/ollama/issues/1853, https://github.com/jmorganca/ollama/issues/1688
But none of them are works here.
Also, if there are any way to install previous version of Ollama (0.1.16), let me know
@Mahmuod1 commented on GitHub (Jan 9, 2024):
@jadhvank
for previous version you can install the docker ollama/hub
@jmorganca commented on GitHub (Jan 9, 2024):
Hi @jadhvank sorry you hit this, looking into it
In the meantime an easy way to install
0.1.17is@iplayfast commented on GitHub (Jan 9, 2024):
I think this is realted to https://github.com/jmorganca/ollama/issues/1691
@IAMBUDE commented on GitHub (Jan 9, 2024):
I also experience this issue with 2x 3090 GPUs. The server just stops generating.
@jadhvank commented on GitHub (Jan 10, 2024):
I updated the Ollama to version 0.1.19 and the stuck happened again in 5 min.
Removed the 0.1.19 and installed 0.1.16.
The stuck occurred after 6 hours (better!)
@EmanueleLenzi92 commented on GitHub (Jan 17, 2024):
I think I have the same problem. After a few runs, the ollama server crashes and stops to generate text. I'm using windows 11 (wsl ubuntu) and langchain. I have a rtx 4090 and I tried from 0.1.16 to 0.1.19, but all of them have this issue in my case.
instead, on a laptop with windows 10 and with an nvidia T500, I don't have this problem.
@hml-github commented on GitHub (Jan 18, 2024):
me too, same problem, stop generation after random time.
@amirdeljouyi commented on GitHub (Jan 24, 2024):
Similarly, it halts after approximately 100 iterations.
@mchiang0610 commented on GitHub (Jan 27, 2024):
wanted to see if anyone is still running into this issue with ollama v0.1.22
@EmanueleLenzi92 commented on GitHub (Feb 2, 2024):
I confirm i still have this problem with 0.1.22
@julienlesbegueriesperso commented on GitHub (Feb 2, 2024):
I confirm also (on MacBook Pro 2,6 GHz Intel Core i7 and on a cpu-only server)
@Simaky commented on GitHub (Feb 2, 2024):
I could confirm that issue with 0.1.23 (on WSL)
I ran the script with 100 requests and saw in the logs that 6/10 requests were frozen and never received a response :(
@svilupp commented on GitHub (Feb 8, 2024):
+1
I run a community leaderboard for Julia code generation and I've run 10s of thousands of samples in the past (with failures, but not unreasonable).
Recently, I've updated and haven't been able to run anything anymore... Same machine/setup
Behavior:
Workload:
System:
Ollama header:
@wac81 commented on GitHub (Feb 13, 2024):
Could it have anything to do with GPU memory management?
My experience is that if you use a 12g gpu to load the llama13b model, the output will basically get stuck if it exceeds 200 tokens.
@jmorganca commented on GitHub (Feb 20, 2024):
This should be fixed as of 0.1.24. Please let me know if that isn't the case, and we'll re-open this (and get it fixed once and for all 😊). Sorry about this!
@StrikerRUS commented on GitHub (Feb 20, 2024):
@jmorganca Unfortunately, it isn't fixed in 0.1.25.
OS: Ubuntu 22.04.2 LTS
GPU: NVIDIA RTX A6000 (Driver Version: 530.41.03, CUDA Version: 12.1)
Model: Tested
mixtral:8x7b-instruct-v0.1-q4_K_M,mixtral:8x7b-instruct-v0.1-q6_K,llama2:7b-chat-q4_0Env: Official Docker
/api/generateand/api/chathangs complitely while version or tags info works well.Even
docker compose restartdoesn't help, only completedown + uphelps.Observed this behavior sometimes with 0.1.23, but 0.1.25 makes things even worse - hangs approximately every hour.
@calebdel commented on GitHub (Feb 21, 2024):
@jmorganca, Likewise still seeing this issue after a small number of iterations on v0.1.25
@EmanueleLenzi92 commented on GitHub (Feb 22, 2024):
I confirm this problem with 0.1.25 and 0.1.26
@StrikerRUS commented on GitHub (Feb 23, 2024):
@jmorganca Can you please reopen this issue?
@BEpresent commented on GitHub (Feb 23, 2024):
Same here, issue still persists on fresh install (calling multiple times in a loop).
@ArjonBu commented on GitHub (Feb 24, 2024):
I am seeing this with 0.1.27 running on docker on linux. Docker has a limit of 8GB of RAM but the container is using only 1.
The container just hangs and shows nothing in logs. I am using open-webui as a frontend.
@julienlesbegueriesperso commented on GitHub (Feb 25, 2024):
I confirm alors on 0.1.27 on Mac OS X, Fedora with GPU (RTX), and Ubuntu (without GPU). In a fastapi + langchain env with 2 endpoints invoking 2 different ollama models , after I succeed in receiving responses from the first endpoint, I'm stuck when I try the 2nd endpoint. I have to restart the ollama service to see my response.
@ytlai1985 commented on GitHub (Feb 27, 2024):
I confirm that this problem occurs with versions 0.1.24 and 0.1.27. After adding a prompt about the output limitation, it seems to be resolved.
Does that mean no [EOS] token has been generated? Using the 'STOP' options will also resolve this problem, but sometimes it may not achieve the ideal result.
OS: Ubuntu 22.04.2 LTS
GPU: NVIDIA L4 (Driver Version: 535.154.05, CUDA Version: 12.2)
Model: Mixtral8x7b-instruct-v0.1-q5_K_M
For example
Limitation prompt
Use options
@dhiltgen commented on GitHub (Feb 28, 2024):
Has anyone come up with a minimal repro with curl or equivalent? I'll try to repro and get to the bottom of this.
@wizardsd commented on GitHub (Mar 1, 2024):
I confirm this problem with 0.1.27 on Windows 10 without WSL. May be related with format=json and stream=false?
@koleshjr commented on GitHub (Mar 4, 2024):
Could someone help us. This issue still persists :
I have updated to the latest release version: v0.1.28 and it still get's stuck after around 200 iterations on google collab free tier t4
@eusthace811 commented on GitHub (Mar 5, 2024):
In my case, it becomes unresponsive right after the initial interaction.
1 gpu Nvidia A4500, 13b q4 K M model
@giedriusrflt commented on GitHub (Mar 6, 2024):
Gets stuck also with:
@jonomillin commented on GitHub (Mar 6, 2024):
I'm getting the same thing with 0.1.26, 0.1.27, 0.1.28 on an M2 Max (64gb ram)
This is both in the cli
ollama run llavaand via python APIs (chat and generate). It will work fine on one or two images, then stall out. There is no crash, it just stops streaming new tokens and hangs.Server logs are as follows for a sample run via Python:
@jithinmukundan commented on GitHub (Mar 7, 2024):
I am facing the same issue after running it on gpu. I had no issues previously when running it only on cpu. Using 0.1.28 and llama index. Eagerly waiting for a solution.
@urinieto commented on GitHub (Mar 9, 2024):
Same issue here :(
I'm on version 0.1.28. It seems to stop working after ~100 to ~3000 queries in my Linux setup.
@ckehagioglou commented on GitHub (Mar 10, 2024):
Regretfully, haven't managed to do so. Nevertheless, I went through the logs and noticed that when Ollama hangs, instead of the normal functions sequence:
launch_slot_with_data: slot processing task -> update_slots: slot progression, kv cache -> print timings: prompt, generation -> update slots: slot release
it goes through the following:
launch_slot_with_data: slot processing task -> update_slots: slot progression, kv cache -> update_slots: slot context shift
The last function executes infinitely until I stop the server and relaunch it. So, might be related to another issue I found (sorry I haven't pinpointed the number) related to infinite context shifting.
Hope the above provides a bit of assistance.
@harmanpreet93 commented on GitHub (Mar 14, 2024):
Facing a similar issue inside docker on Ubuntu 18.04 with
ollama version 0.1.28on Quadro RTX 5000.@syrom commented on GitHub (Mar 14, 2024):
Same issue. Running ollama 0.1.28 on M1 Max.
My observation:
I work on a large number of text chunks as input for a RAG algorithms - and the task for the LLM (mixtral in my case) is to extract keywords and concepts from the chunks. The document is rather large; so it makes quite a diffference if I set the character count for the text split to produce chunks of a length of 1.000 or 2.000 characters. These chunks are served to Mixtral as USER_PROMPT - and the SYSTEM_PROMPT by itself is also rather long.
Now the key observation: the failure seems to be functionally dependent on the length of the overall prompt.
If I set text-split to 2.000 characters, the overall prompt lenghth is much longer - and the failure occurs much quicker (5-10 generations) than if the text-split is set to 1.000 characters (around 15-20 generations). Unfortunately, the algorithsm ought to work its way thru > 400 to 800 chunks.... which it doesn't.
Long story short: the occurence of the bug seems to be a function of the number of tokens being served to the LLM thru Ollama.
@niyogrv commented on GitHub (Mar 20, 2024):
I'm observing this issue in 0.1.28 on Ubuntu 22.04 with a 3060(Driver: 535.161.07, CUDA: 12.2) and 16GB RAM running TheBloke's Q6 Mistral Instruct v0.2 GGUF.
I'm encountering this only when I'm setting "format"="json". I am using the model for a classification task and only got through 5 queries before it hung up and I had to restart ollama. I was able to reproduce this consistently and it always failed at the 6th query
I reran it, this time without the "format"="json" param, and I am 4k+ requests in without a crash
UPDATE:
It crashed at around 5.7k requests. So, while the json format enforcement seems to accelerate the issue, it still seems to happen if you're constantly bombarding the model with requests. Hopefully, this gets fixed soon :(
@dhiltgen commented on GitHub (Mar 20, 2024):
This will likely be resolved with #3218 but I'll leave this open until we can verify the health check logic is sufficient to catch this hang scenario.
@syrom commented on GitHub (Mar 31, 2024):
My first feedback after the last Ollama update: the situation has improved a lot, but has not gone away alltogether.
I tried it out on several text sizes - and now it works on longer texts, but still gets eventually stuck on very long texts.
Before the last update, I would not get more than 30 generations in a row, feeding the algorithms with text chunks of the size of 1.000 characters. Now, it works up to around 100 generations and slightly north of that.
But e.g. processing text consisting of 200 or more chunks still gets the process stuck eventually.
@omani commented on GitHub (Apr 5, 2024):
I have to stop docker. rm the docker instance and run it again to solve this issue. I hope someone fixes this soon.
@omani commented on GitHub (Apr 5, 2024):
here is an example of my local ollama in docker hallucinating:
this happens with almost all models after some time. sometimes within minutes of cancelling and restarting the model.
@omani commented on GitHub (Apr 5, 2024):
what is this? why is this happening with all my models? does it have anything to do with ollama?
@traddo commented on GitHub (Apr 7, 2024):
I ran into the same snag when I was working on summarizing text. Tweaking the prompt words sorted it out for me. I added a bit in the prompt to make sure the summary stays between 100 and 200 words.
@Mecil9 commented on GitHub (Apr 7, 2024):
the same issue!

my system:
apple m1 max 64G
At the initial run, everything works fine. When questions are asked continuously, the system will be stuck, the CPU usage will continue to increase, and the GPU will be reduced to 0 at the same time.
Once the CPU reaches 100%, ollama will stop working. I have tried many methods to no avail!
@WithAnOrchid commented on GitHub (Apr 8, 2024):
This happened to me as well. After some research and test, I found setting the option
num_keppto0fixed this issue.Possibly related to #2805 , #2225
Python code I used:
@ckehagioglou commented on GitHub (Apr 9, 2024):
Looked all over the place to find what num_keep does but with no avail. All I found is that num_keep default value is 0. Version 1.31 hangs even more often for me. Team ollama is doing a great work, but this bug is destroying the experience.
Working on a Mac Studio M2 Max 32GB, running many summarization tasks in sequence - if it helps.
@traddo commented on GitHub (Apr 9, 2024):
I added the num_keep parameter but the bug still exists. For now, I'm using a timeout to kill the ollama process as a workaround to complete the batch summary tasks.
I use a bash script to start the 'ollama serve &' process, checking every 2 seconds to see if the ollama process exists, and if not, start it.
When calling the API, I add a timeout limit of 1 minute. If a timeout exception occurs, I kill the ollama process, then wait for 2 seconds, remove the task causing the timeout, and start the loop again.
@abhinav-kashyap-asus commented on GitHub (Apr 9, 2024):
I also have this bug... :( Unfortunately sometimes even restarting the ollama server is not helping it
It just hangs
@mrroll commented on GitHub (Apr 9, 2024):
I have the same experience. Adding the parameter does not prevent Ollama from getting stuck.
@danest commented on GitHub (Apr 9, 2024):
This happens to me too so I wrote a bash script that manages and just restarts it every 10 minutes....
@jdonaldson commented on GitHub (Apr 9, 2024):
Hitting the stability issue here as well. Had to add a reset action in my
neovim so I could poke it awake more easily :
https://github.com/jdonaldson/dotfiles/blob/main/.config/lvim/config.lua#L45
On Tue, Apr 9, 2024 at 9:17 AM Kevin Urrutia @.***>
wrote:
@dmitry-sablin-db commented on GitHub (Apr 12, 2024):
Got same issue , seem to be problems caused by using codelama 34b . But its not exactly , just iterative check
@jtoy commented on GitHub (Apr 12, 2024):
Just to give more notes, I use ollama on mac and linux. On linux for me it seems moe stable. On a mac studio m1 and mac book pro m1, I have to restart it every dozen or so requests because it just freezes. I want to run this on my mac studio as a server, but its too unstable. I am going to just add a restart script every hour to see if that fixes it.
@javierrivarola commented on GitHub (Apr 15, 2024):
Same issue running on Macbook 16 M3 Max with 36 GB of ram, ollama hangs after an hour or so of usage, logs don't seem to indicate nothing wrong happened.. seems that i'll need to use a cronjob to restart it every hour
@danomatika commented on GitHub (Apr 16, 2024):
We are seeing the same issue with Ubuntu 20.04 LTS and 2 x A100. So far I am taking a timeout check and restart approach by running the following script every 10 minutes with cron,
ollama-check:This may not be the best solution, but we will try it for now.
@jossalgon commented on GitHub (Apr 16, 2024):
By modifying this with the latest version, I have not had any more problems. If anyone tries it too and it works for them I can do PR.
@dhiltgen commented on GitHub (Apr 17, 2024):
Please give 0.1.32 a try and let us know if you're still seeing unrecoverable hangs.
@airbj31 commented on GitHub (Apr 17, 2024):
I still have the same issue in both Linux computer (Ubuntu22.04 + GTX4090) and Macbook Pro (M3), but the tendency was reduced compared to the previous version (v0.1.30)
@calebdel commented on GitHub (Apr 17, 2024):
0.1.32 seems to have fixed the issue for me. 2000+ iterations so far without a hang. Previously 5-10 iterations would cause Ollama to hang.
@BruceMacD commented on GitHub (Apr 18, 2024):
Thanks to everyone for reporting and testing this. Marking this as resolved for now pending any more reports.
@kungfu-eric commented on GitHub (Apr 25, 2024):
Hangs after about 400 long context requests on mixtral and same with llama3
The hang continues to output this on the ollama server but no response is given to the client
@entmike commented on GitHub (Apr 25, 2024):
Still having the problem here on version
0.1.32. I am running batches of image annotations with llava and it will just hang after a few dozen images or so.RTX 4090
Ubuntu 22.04
Running via Docker container
Restarting the container kicks it back into submission but looking for a a more reliable answer.
@kirill-vas commented on GitHub (Apr 26, 2024):
Also still experiencing hangs when calling
/api/chatendpoint. Running on HumanEval benchmark (164 samples), usually fails about 70-80 calls. Requiresollama serverestart to recover. Mostly happens with CodeLlama-70b rather than the smaller models (13b, 7b; only tested these).Running v0.1.32 on Ubuntu 22.04.2 with NVIDIA RTX A6000, driver 530.30.02, CUDA 12.1, using a Docker container
The specific part where it seems to loop indefinitely is the
update_slotsfunctionwith
"msg":"slot context shift"line fromollama servelogs (full logs below):Code that calls the endpoint:
Full log of the run below:
@omani commented on GitHub (Apr 26, 2024):
I dont understand the hurry to close this issue without getting enough feedback first. where have you learned this @BruceMacD ? or is this normal procedure in your dev workflow?
@EmanueleLenzi92 commented on GitHub (Apr 27, 2024):
I still have this problem with 0.1.32 version with rtx 4090 and windows 11 (on wsl ubuntu).
After a few run (less then 10), the Ollama server is stuck and i can't access anymore to "localhost:11434/" unless i kill the process
@frederick-wang commented on GitHub (Apr 27, 2024):
Got the same bug with A100 on Ubuntu 22.04. ollama version is 0.1.32.
@frederick-wang commented on GitHub (Apr 27, 2024):
Sorry bro @BruceMacD, I found that this issue has not been resolved. I encountered the same stuck issue yesterday (ollama 0.1.32, A100, Ubuntu 22.04) and had to restart to resolve it.
@ckehagioglou commented on GitHub (Apr 28, 2024):
Same bug here. Mac M2 Max Studio hangs after several questions being asked.
@BruceMacD commented on GitHub (Apr 28, 2024):
Thanks for the reports, re-opening this.
Couple of questions to help me reproduce:
@airbj31 commented on GitHub (Apr 28, 2024):
@EmanueleLenzi92 commented on GitHub (Apr 28, 2024):
@dhiltgen commented on GitHub (Apr 28, 2024):
The pre-release for 0.1.33 is available now, which should resolve these long context hang/loop problems.
@syrom commented on GitHub (Apr 28, 2024):
@dhiltgen Great news, thank you: will try asap after I have the update installed.
FYI, the situation has alread improved considerably - but hangups still there are with 0.1.32.
I experienced a hangup after having Ollama / Mixtral churn thru a large text file for > 12 h, extracting semanting information from it.
Setup: M1 Powerbook with 64 GB RAM and Ollama 0.1.32.
The text had 623 chunks with 1000 characters each (plus another ca. 400 characters prompt size) - and the hangup occured after processing 517 of these chunks.
@WeirdCarrotMonster commented on GitHub (Apr 28, 2024):
I can still encounter this problem on 0.1.33: ollama gets stuck after 15 minutes of embeddings processing (using nomic-embed-text). Last log lines:
GPU: NVIDIA GeForce RTX 3060
Driver version: 550.54.14
CUDA Version: 12.4
@janis-inzpire commented on GitHub (May 3, 2024):
Just to add to this a bit - looks like we are experiencing same issue.
Running llava model, gets stuck every 15 - 20 minutes. Sometimes gets stuck after just 4 requests.
We are using API to call the endpoint. running version 0.1.33 though a docker container.
@ukrolelo commented on GitHub (May 5, 2024):
+1 stuck on question in different language.
@syrom commented on GitHub (May 6, 2024):
A quick feedback: from my perspective, the bug is solved as far as Ollama running on Mac Silicon is concerned. I was never able to to process more than ca. 120 text chunks of a size of 1.000 characters in one go on an M1 Pro Mac. Now, with the upate to 0.1.33, the computer ran for 24 h nonstop, processing ca. 630 text chunks of a larger document to extract information from it.... and did so to the very end.
Simply: Thanks !
@maciejmajek commented on GitHub (May 9, 2024):
Still happens to me with llava models @ ollama v0.1.34
Interestingly, Ollama only freezes up when I use the /chat endpoint with both image and text data. It works fine when only text is sent.
I've noticed that the problem gets worse when I hit the /chat endpoint with multiple prompts at once using Ollama's queuing system. It tends to hang after about 30 seconds...
Setup:
2x RTX 4090
13900k
logs:
Last succesful chat call
[GIN] 2024/05/09 - 18:36:57 | 200 | 8.140684188s | 10.244.163.252 | POST "/api/chat"
time=2024-05-09T18:36:57.971+02:00 level=DEBUG source=sched.go:294 msg="context for request finished"
time=2024-05-09T18:36:57.971+02:00 level=DEBUG source=sched.go:232 msg="runner with non-zero duration has gone idle, adding timer" model=/usr/share/ollama/.ollama/models/blobs/sha256-1834da0de12e8d8c4cce928b0020f25311d5fca5ae77be8fc9039f8bcda1833a duration=5m0s
time=2024-05-09T18:36:57.971+02:00 level=DEBUG source=sched.go:248 msg="after processing request finished event" model=/usr/share/ollama/.ollama/models/blobs/sha256-1834da0de12e8d8c4cce928b0020f25311d5fca5ae77be8fc9039f8bcda1833a refCount=0
time=2024-05-09T18:36:59.457+02:00 level=DEBUG source=sched.go:435 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-1834da0de12e8d8c4cce928b0020f25311d5fca5ae77be8fc9039f8bcda1833a
{"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":850,"tid":"140043376336896","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619}
{"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":851,"tid":"140043376336896","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619}
{"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":852,"tid":"140043376336896","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619}
time=2024-05-09T18:36:59.594+02:00 level=DEBUG source=prompt.go:172 msg="prompt now fits in context window" required=1988 window=2048
time=2024-05-09T18:36:59.595+02:00 level=DEBUG source=routes.go:1241 msg="chat handler" prompt="<|im_start|>system\n<|im_end|>\n<|im_start|>user\n- blah blah <|im_start|>system\n<|im_end|>\n<|im_start|>user\n[img-0] [img-1] input: two consecutive images blah blah <|im_end|>\n<|im_start|>assistant\n" images=2
time=2024-05-09T18:36:59.595+02:00 level=DEBUG source=server.go:591 msg="setting token limit to 10x num_ctx" num_ctx=2048 num_predict=20480
{"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":853,"tid":"140043376336896","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49272,"status":200,"tid":"140038952120320","timestamp":1715272619}
{"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":854,"tid":"140043376336896","timestamp":1715272619}
{"function":"update_slots","level":"INFO","line":1837,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":854,"tid":"140043376336896","timestamp":1715272619}
@mironnn commented on GitHub (May 16, 2024):
The same. still, have issues
ollama version 0.1.38
RTXA6000
Llama3:70b
Hangs on <10 requests
@quaintdev commented on GitHub (May 23, 2024):
Has happened with multiple models for me. My prompt is usually just a line. I have seen that it happens if I kept it idle for sometime. When I come back the responses are stuck. I am running on CPU. The first prompt after starting
ollama servecommand always gets quick response. I am on0.1.37Edit: Happens on
0.1.38too. I always see something like below when this happensWith easyllama and I don't see such issue.
@sammcj commented on GitHub (May 26, 2024):
Hi all, give this fix a go: https://github.com/ollama/ollama/issues/4604#issuecomment-2130436000
@quaintdev commented on GitHub (May 26, 2024):
I'm not using Docker so I don't think this fix is applicable to me.
@jak4 commented on GitHub (May 27, 2024):
I'm experiencing a similar issue. I'm running on a virtualized VM with a Tesla P40. After booting the VM everything works, but after a while, when the server idles, it stops working. Neither requests from a frontend nor from the cli (e.g. ollama run llama3) work. With the cli it just never starts up. The log files dont show anything suspicous. "service ollama restart" does nothing. The only thing that is maybe not aligned is, that I'm having CUDA 12.2 installed but the runner is using v11.
Edit: Version 0.1.38
Edit: Also happens on Version 0.1.39. What is maybe interessting is, that this happens regardless of running queries against the LLM or not. After booting the VM and not running any query for an unspecified amount of time, but less than 2 h, ollama becomes/is unresponsive. It seems the model gets loaded, but doesnt finish. After loading the model up to a certain point, with LLama 3 to around 4800 MiB of GPU RAM, the loading slows to a crawl and the GPU RAM usage increases at 2 MiB for every few (two seconds?). At some point it increases with 6 MiB per every few seconds (at 4922 MiB), and then stops completly (at 4934 MiB). After a while the process stops and the GPU RAM is completly empty again.
When comparing a working ollama instance to a non-responsive instance the load speed for the model is way higher when everything works out. The model I used for this testing uses 4934 MiB when fully loaded. Which tracks with the number above.
@blubbsy commented on GitHub (May 28, 2024):
i'm seeing the same problem on Windows. I'm using llava:v1.6 and pass the images through bind(...) as base64 and then invoke the promp. works fine for few prompts and then stops.
i'm at the moment checking if maybe something else could be wrong, but as my experience fits to what i read here i want to mention it.
@jak4 commented on GitHub (May 31, 2024):
I'm seeing the same issues with vLLM which indicates a problem with some underlying libraries, e.g. torch or maybe something CUDA. What is fascinating is, that this has apparently nothing to do with time between requests or even "going into sleep mode" since I managed to perform a query to vLLM which was working perfectly with around 15 tokens per second and then slowed to a crawl with 0.1 tokens per second. So even while the generation is running this stuff happens.
@jak4 commented on GitHub (Jun 1, 2024):
I have resolved my issue. It had nothing todo with ollama, vllm, or any other part of the software stack. It was a LICENSING issue. I simply forgot to aquire a license for the vGPU the VM was using. So after a while the nvidia driver degraded the performance of the vGPU to become basically unusable.
@mchiang0610 commented on GitHub (Jun 1, 2024):
@jak4 thank you for letting us know about this. May I ask what was the VM provider so we know in the future what to lookout for?
@jak4 commented on GitHub (Jun 3, 2024):
@mchiang0610 I'm unsure what you mean bei VM provider, but I'm running a homelab with PROXMOX as the VM host and Tesla P40. The guest is a Debian 12 instance.
@azeezabdikarim commented on GitHub (Jun 6, 2024):
I am having the same issue running:
ollama 0.1.41
M2 Max with 64gb ram
I was initially using the 'llava' model, which would hang after ~10 image and prompt pairs. Now I have switched to 'llava-llama3' and am able to process ~20 requests before it hangs.
@Luzifer commented on GitHub (Jun 6, 2024):
With
llama3:8banddolphin-mistral:7bthev0.1.41produces complete garbage after some prompts. Downgrading tov0.1.38solved this for me: both models behave properly as before. (Both versions built through the Archlinux build process.)@dhiltgen commented on GitHub (Jun 6, 2024):
I haven't been able to reproduce this yet
On an M3 mac, the following loops at least 80+ times without problem:
A cuda windows system with this also loops cleanly
Perhaps there are some non-default settings being passed via API clients that is causing the hang? Can anyone share a minimal curl loop that repro's?
@Luzifer commented on GitHub (Jun 7, 2024):
It looks like it's way easier to break with a model derived from
llama3:8bwith a longer system prompt (I'm not able to share) than with the plainllama3:8bbut eventually a chat (OpenWebUI) with thellama3:8balso broke down: chat-Unseen Backyard Secret Revealed.txtJust guessing: as a longer system prompt model breaks earlier I'd say the bigger the amount of text, the earlier it breaks.
After that linked chat even
ollama runs are producing garbage:Run Output
@Earnest-Williams commented on GitHub (Jun 27, 2024):
This happens frequently with 0.1.45 and dolphin-mixtral, 26 gb version. Ollama is running from command prompt on my XTX. Does not seem to happen with smaller models.
@hybra commented on GitHub (Jun 30, 2024):
I was having the same issue, on a Mac, what I just found out that if I run (0.1.48)
ollama serve &from terminal then I incur in the issue of having the server crashing after 5-6 requests with /api/generate , and the .ollama/logs/server.log is not created/populated.
while if I run
open /Applications/Ollama.app &the log is created and the server starts working flawlessly (I left it all night and we're now at 600 logged requests), and the Ollama icon appears on MacOS menu bar (where you can also quit it).
So there is definitely some difference between the sole launch of the ollama server and the startup of the app, although I can't say what it is with precision that causes the issue, and blocking the logging, but at least it sounds like there's a workaround. I find this behavior to be consistent.
Could someone on Linux and Windows check out too?
@jtoy commented on GitHub (Jun 30, 2024):
Honestly from all my personal experience and talking to lots of other developers, Ollama is really good for quick prototyping and testing models, but due to ongoing issues like this, is not really meant for production. For production vllm is much more stable and seems to be used by lots of companies. — Sent from my mobileOn Jun 30, 2024, at 1:49 AM, hybra @.***> wrote:
I was having the same issue, on a Mac, what I just found out that if I run (0.1.48)
ollama serve &
from terminal then I incur in the issue of having the server crashing after 5-6 requests with /api/generate , and the .ollama/logs/server.log is not created/populated.
while if I run
open /Applications/Ollama.app &
the log is created and the server starts working flawlessly (I left it all night and we're now at 600 logged requests), and the Ollama icon appears on MacOS menu bar (where you can also quit it).
So there is definitely some difference between the sole launch of the ollama server and the startup of the app, although I can't say what it is with precision that causes the issue, and blocking the logging, but at least it sounds like there's a workaround. I find this behavior to be consistent.
Could someone on Linux and Windows check out too?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>
@hybra commented on GitHub (Jun 30, 2024):
vLLM seems to be Linux only. But we're OT now
@emyasnikov commented on GitHub (Jul 11, 2024):
It seems to me that using LLaVA Ollama freezes completely after a couple of hundreds requests. I can't find any errors or other issues which could occur
@itinance commented on GitHub (Aug 15, 2024):
On may Mac m1, it runs since hours serving hundrets of requests. On my Hetzner server with Nvidia RTX 4000, it gets in stuck after some requests.
@itinance commented on GitHub (Aug 15, 2024):
After 10 mins or so, it starts to hang on Linux running a RTX 4000. On Mac m1, it runs properly through since 24 hours.
https://github.com/ollama/ollama/issues/6380
@hhhhzl commented on GitHub (Sep 26, 2024):
Same issue here, 4 V100 32G GPU on Linux, docker ollama, llama2 70B. It hangs on for two hours after some iterations. But 4 GPU are activate (all about 30%) during it hangs on. It generates smoothly for 100 iterations, and hangs on, then generates some again, and hangs on ......
@omani commented on GitHub (Sep 27, 2024):
it amazes me that this issue is persisting for over 8 months now and nobody knows how to fix it.
@blubbsy commented on GitHub (Sep 27, 2024):
yeah, i think the problem is that there nobody has a clue why it is happening or couldnt debug it properly.
do the ollama alternatives have the same problem or are vLLM or localAI working without problems for the same tasks? because this would really move me to move to another system...
@omani commented on GitHub (Sep 28, 2024):
I think the devs aren't aware of the fact that they are building an app that does not run properly. the commits show that this repo is active. but that doesnt do anything if the app is not working for many people.
and if they are aware then it looks like wrong priorization of tasks. if I were a project manager or team lead or whatever, I would stop with anything else, and make my people fix this bug with top prio. because obviously happy path is broken.
@blubbsy commented on GitHub (Sep 28, 2024):
Well, it does run. The problem seems to only happen with the very large models. if you use llama <7B you dont have these Problems. At least for me it only happens with the large models (for my applications i do not need the big ones).
@maciejmajek commented on GitHub (Sep 29, 2024):
I wonder if that happens with llama.cpp too. Does anyone have any insights on that?
@itinance commented on GitHub (Sep 29, 2024):
we use llama.cpp meanwhile and there it works perfeclty.
@jessegross commented on GitHub (Oct 9, 2024):
We may finally have a solution to this. For those that are experiencing the problem and are able to build from source, there is a new runner module that is currently being tested. Instructions for building it are here:
https://github.com/ollama/ollama/blob/main/docs/development.md#transition-to-go-runner
@jason-ni commented on GitHub (Oct 9, 2024):
Yes, the llama.cpp server works well without similar issue. However, I'm looking for tool calling api support recently, it seems llama.cpp server is lacking behind as I saw related issues are still opening.
And glad to see ollama team is addressing on this issue. Well done! Thank you!
@blubbsy commented on GitHub (Oct 9, 2024):
is there a timing when you think an official build will be provided?
@jessegross commented on GitHub (Oct 9, 2024):
We are phasing it in (opt-in, opt-out, etc.) to try to catch any surprises. The general goal is have it broadly available by the end of the month if nothing major comes up. However, the more people that are able to test it, the faster we can build confidence.
@WeirdCarrotMonster commented on GitHub (Oct 11, 2024):
In my case it seems to have helped — i was able to leave ollama running overnight, processing batches of embeddings via
/api/embedendpoint. It was running little under 8 hours with no freezes or error logs.@Shahin-rmz commented on GitHub (Oct 16, 2024):
Thanks for taking the problem into consideration. I am mostly working with Google colab, and I can not run the ollama for a long run.
happy to test the new feature if I could.
@dhiltgen commented on GitHub (Oct 23, 2024):
Please give the latest 0.4.0 RC release a try and let us know how it goes.
https://github.com/ollama/ollama/releases
@willypaz243 commented on GitHub (Nov 17, 2024):
I have a similar problem, I use an extension in vscode called "Continue - Codestral, Claude, and more" which uses ollama as an llms provider, I have configured the tabAutocomplete function which makes queries to ollama to autocomplete code, but since version 0.4.0 ollama gets stuck after some queries, running
ollama run codegemma:codeI saw that when making some queries it keeps generating tokens without stopping and I can believe that is the cause of the hangup, and also that prevents the model from stopping withollama stop codegemma:code,ollama psshows thatstopping ...but it never ends.@dhiltgen commented on GitHub (Nov 18, 2024):
@willypaz243 you might be experiencing the same thing as #7645 - With OLLAMA_DEBUG=1 set, when it gets stuck we see periodic
context limit hit - shiftingin the logs and theollama_llama_serverprocesses saturates 1 CPU core.@naffiq commented on GitHub (Nov 19, 2024):
Unfortunately, I am also experiencing this issue with version 0.4.2.
MacBook Pro with Apple M2 Pro and 16 GB of unified memory
@jessegross commented on GitHub (Nov 21, 2024):
I think the original issue is fixed but it sounds like there is a new issue with somewhat similar symptoms. I'm going to close this issue so we can track it in a single place in #7645. For those that are running into this, we have made further improvements in this area so it would be helpful it you can test with 0.4.3-rc0 (or later) and report the results in the other bug.
@KalyanKumarAdepu commented on GitHub (May 12, 2025):
Hi, I am using an EC2 instance of type g5.4xlarge (with an A10 GPU). I installed Ollama and tried using the llama3.2:3b model. I have a DataFrame with 600 rows, and for each record, I need to call the LLM model 26 times sequentially. I tried running this in a loop, but after completing certain milestones like 10, 50, or 100 records, the LLM API stops responding — it literally gets stuck. How can I resolve this issue?
@hossam1522 commented on GitHub (May 17, 2025):
Facing the same error with mistral:7b.
@voycey commented on GitHub (Jul 5, 2025):
This is still happening in July 2025 unsloth/gemma3 models
@bennyschmidt commented on GitHub (Jul 8, 2025):
Can confirm it still happens with high volume of requests - it will eventually hang - but it seems like a normal concurrency issue that the end developer should deal with (not Ollama).
What I am doing is managing my own queue using a library (bee-queue) with Redis, enqueuing every request and running through the queue at a static interval - ensuring each Ollama request completes before the next one is sent in. No more issues with hanging.
There is apparently a way to accomplish it within Ollama (via
OLLAMA_NUM_PARALLELandOLLAMA_MAX_QUEUE) and even control these per model viaModelfile- but I haven't been able to get the built-in queue to work at scale.Edit:
OLLAMA_NUM_PARALLELworks as intended, but there are 2 layers to scaling it for a high volume of requests:Scaling LLM requests to the max CPU load (likely a small number on your personal machine). This is the flow of requests from your app to Ollama.
Scaling your app's network requests to handle all incoming traffic (however concurrent it may be). This is the flow of requests from end users, through your app, to Ollama.
That's why even though Ollama has an internal queue, your app that uses Ollama likely still needs one.
@voycey commented on GitHub (Jul 8, 2025):
We have tried with our own queue - its the "Ensuring Ollama request completes" thing that is tripping up because when it hangs the task never completes and there is no natural TTL on the request.
@bennyschmidt commented on GitHub (Jul 8, 2025):
You can still handle timeouts in your app though. Just an opinion, but I think developers should manage the abstraction of handling a high volume of requests, and not Ollama. It's just a wrapper library for LLMs. If you have an API that handles a high volume of requests – yes, even those that can timeout with no response – you should just manage that in your application.
@voycey commented on GitHub (Jul 8, 2025):
But the point is - how do you manage a timeout by which there can be no natural set TTL? Sure, I could say "No request should go on longer than 20 minutes" but thats horribly inefficient and some requests can easily go on for 10-15 minutes. No API would hold a connection that long and slower responses on locally hosted LLMs expectedly might take that long.
vLLM doesnt have this issue. This sounds like a workaround that is required because Ollama doesnt handle it correctly otherwise I would need to do the same in vLLM.
@bennyschmidt commented on GitHub (Jul 8, 2025):
Precisely the point. With this approach, all your API endpoint does is enqueue LLM requests. Your API endpoint does not hang around for the lifecycle of the LLM request.
For such a task, you need a queue.
Beyond that point (and closer to the the Issue), the problem isn't really long-running APIs anyway but the fact that Ollama can't handle many thousands or millions of concurrent requests. You have to enqueue those in your application.