mirror of
https://github.com/ollama/ollama.git
synced 2026-05-06 08:02:14 -05:00
Closed
opened 2026-04-12 14:14:22 -05:00 by GiteaMirror
·
46 comments
No Branch/Tag Specified
main
parth-mlx-decode-checkpoints
dhiltgen/ci
hoyyeva/editor-config-repair
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
hoyyeva/launch-backup-ux
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
brucemacd/download-before-remove
parth/update-claude-docs
parth-anthropic-reference-images-path
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#3534
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @loveyume520 on GitHub (Jul 13, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5668
What is the issue?
After running for a while, the model still returns gibberish:
Then try posting and it respond:
Here's ollama serve:
OS
Windows
GPU
AMD
CPU
AMD
Ollama version
0.2.3
@somnifex commented on GitHub (Jul 13, 2024):
Same problem
@lalahaohaizi commented on GitHub (Jul 13, 2024):
同样的问题,处理几句话后就会持续输出乱码。

@ototsu commented on GitHub (Jul 14, 2024):
The same glm4 issue occurs in version 0.2.5 , but it runs in the llama.cpp normally.
@arkerwu commented on GitHub (Jul 15, 2024):
Same issue
@HuChundong commented on GitHub (Jul 15, 2024):
+1
@DanielusG commented on GitHub (Jul 17, 2024):
the q8_0 version doesn't have this problem, so i assume it is a problem of quantization
@ototsu commented on GitHub (Jul 17, 2024):
@DanielusG I've tried running q8 and q4 quantization on GLM4 on Ollama, and both resulted in this issue, but it didn't occur on llama.cpp. It seems like the problem isn't related to quantization.
@DanielusG commented on GitHub (Jul 17, 2024):
@ototsu it is really strange, with ollama 2.5 it works well the q8_0 on my PC, I used it extensively
@DanielusG commented on GitHub (Jul 17, 2024):
With the latest version of llama-server when i load the model i get "The chat template that comes with this model is not yet supported, falling back to chatml." This can be the error
@Speedway1 commented on GitHub (Jul 20, 2024):
The GGGG issue is because of a fault with the copying between more than 1 AMD GPU. The fact that some quantised versions run, this is probably because the model is fitting into a single GPU memory so that's why it works. You can fiddle with the context window and get a smaller model to run in a single GPU VRAM. However if you extend the context window so that it needs more than one GPU then it fails.
There was recently a fix that implemented the essential llama.cpp flag for AMD builds (GGML_CUDA_NO_PEER_COPY) but it seems that there are other AMD issues with memory copying between GPUs as well.
@somnifex commented on GitHub (Jul 21, 2024):
This may not be solely due to AMD. I've reproduced this issue using two NVIDIA GPUs (3090 and titanx). However, I agree that it's a problem caused by cross-GPU, and the behavior is very similar.
@wxfvf commented on GitHub (Jul 23, 2024):
I encountered the same issue on CodeGeex4 with dual 4090 GPUs, using the Ollama version 0.2.5. It runs normally at first after loading, but problems eccur after a while.
@leizhu1989 commented on GitHub (Jul 24, 2024):
I pulled the official image and ran it. At first, I could use the Python Asynchronous Client interface to request inference normally. After a period of time, when I tried to request again, I would see "GGGGGGGGG". I ran “ollama run glm4:9b-chat-q8_0” in the image and was able to request and reply normally again
@wszgrcy commented on GitHub (Aug 1, 2024):
Same issue, some content can be returned normally, some content is GGGG
相同的问题,部分内容能正常返回,部分内容GGGG
@AeneasZhu commented on GitHub (Aug 13, 2024):
Any improvements in the future? The GGGG problem seems to be more serious in 0.35. @rick-github
@rick-github commented on GitHub (Aug 13, 2024):
Recent server logs would help in debugging.
@pdevine commented on GitHub (Sep 12, 2024):
I'm going to go ahead and close this out. Make sure you have the most recent version of ollama and the model. We can reopen if people are still having issues, but I couldn't repro at all.
@MDev-eng commented on GitHub (Dec 8, 2024):
Problem of ollama outputting a string of "G"s is still present with latest Ollama version 0.5.1.
My setup is: Proxmox hypervisor environment, Ollama 0.5.1 (latest as of today) running as system service in a VM, Open-WebUI 0.4.8 (latest as of today) running in a container in another VM, two RTX 3060 12GB GPUs (with Proxmox pass through to access them from Ollama VM)
Model used is llama3.2-vision:latest
Use case is: start a new chat in open-webui, select llama3.2-vision:latest, leave all OpenWebUI default settings (including the context length which is 2048 by default), then load an image in the chat and ask "Describe the image".
SETUP 1: Dual GPUs, both used
What happens (as seen with nvtop and nvidia-smi) is that, as soon as the first question is entered, Ollama detects that there are two 12 GB GPUs, and spontaneously loads approx 8.5 GB of the LLM model on the first GPU, and 5 GB on the second. Then, what happens may slightly vary.
Sometimes (very infrequently), the first question is answered properly, but then, if a second question is placed, the answer is "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG".
Other times, already the first answer is "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG".
In both scenarios, anyway, once Ollama starts answering "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG", ALL subsequent questions are answered with "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG". It never goes back to normal, and you never get a normal answer again.
Even if a new chat is started on openWebui, all answers (even the first one) are always "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG". That is, it is not a openWebui problem that can be reset by acting on OpenWebui.
This suggests that Ollama server has entered some corrupt internal state.
In some cases, it even happens that "ollama" process goes to 100% CPU and never goes down.
The only way to restore normal behavior (both for the GGGGG problem and for Ollama process going crazy) is to stop and restart Ollama service.
SETUP 2: Dual GPUs, but Ollama restricted to using only one
If instead I force Ollama to use only one GPU, by specifying "CUDA_VISIBLE_DEVICES=0" in /etc/systemd/system/ollama.service before starting the Ollama service, then Ollama loads the entire model (about 11.3 GB) on the 12GB VRAM of the first GPU. and the "GGGGGGGG...." problem no longer occurs. But I lose the possibility to use both GPUs to load larger models or to expand the context size.
I think there is a problem with how Ollama allocates LLMs on multiple GPUs.
This is an important use case especially in educational and non-corporate use, because not all LLMs can fit a single GPUs of affordable cost, but in some cases, they can fit a combination of two or three cheap GPUs, which significantly expands the possibilities of self-hosted experimentation with LLMs.
@MDev-eng commented on GitHub (Dec 8, 2024):
Another point is that with another LLM, qwen2.5-coder:32b, which definitely does benefit from the multiple GPUs (Ollama loads 9.9 GB on the first GPU and 10.1 GB on the second), the GGGGGGG... issue does not occur.
The problem only appears when the model is llama3.2-vision:latest, although this model is smaller than qwen2.5-coder:32b.
I think that Ollama should be able to run correctly and automatically whatever LLM is chosen and on whatever underlying set of available GPUs, provided that their VRAM is sufficient of course. No "GGGGG..."s should ever be answered, and Ollama server process should never go crazy at 100% CPU.
Even if the above occurs with a certain LLM and not with another, the LLM should not be blamed for these bugs. Ollama should behave properly, or issue an explicit error diagnostic, but not go randomly crazy.
@rick-github commented on GitHub (Dec 8, 2024):
Recent server logs would help in debugging.
As Patrick pointed out, this issue is difficult to replicate. There has been speculation up-thread about causes, but without logs and detailed information about the environment and workload, there will be no progress on this issue.
@leizhu1989 commented on GitHub (Dec 9, 2024):
I used the 450 driver version of Nvidia T4 and asked several questions about the large model, I encountered GGGGG issues. However, with the 470 driver version, I haven't encountered any issues for hundreds of times
@MDev-eng commented on GitHub (Dec 10, 2024):
Hello, in my setup it is the contrary: it is reproducible 99% of the times. The times it works as intended are rather the exception than the rule.
Ollama log attached.
The usage pattern is:
start with both ollama server down and open-webui down, and the two GPUs VRAM empty.
start ollama server (installed as a service in a dedicated VM). Startup is complete at 05.56.41.
start open-webui (Docker container in separate VM)
open the open-webui page,
authenticate
load an image in open-webui
place the question "How many question marks are there in the image?"
ollama processes the request, and spontaneously decides to load the model split across the two GPUs: 8.4 GB on the first and 3.75 on the second.
a few seconds elapse while the GPUs are processing
then at openwebUI side, the answer given is "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG" (the string does not appear in a single shot; rather, the "G"s appear gradually like in normal token generation)
the string appears also in the ollama log, but I think it appears not as an output from the ollama server, but rather, as part of the body of a POST request issued by open-webUI in which the chat history is re-included as context (thus the GGGGGG... string is present)
the last ollama log message while processing seems to be normal, and before the "GGG.."s begin to appear, is at 06.01.55:
level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=22 used=0 remaining=22
and then at 06.02.01 the "GGG..."s start to appear, and the ollama log message while that happens is
level=DEBUG source=server.go:836 msg="prediction aborted, token repeat limit reached"
NOTE::: In the RARE cases in which the "GGGG..." thing does not happen, the above line does NOT appear and that's basically the only difference in the ollama log between a successful and unsuccessful test execution.
So I think the condition to be investigated is "prediction aborted, token repeat limit reached" or what can be resulting in that message.
Note that after that message there are about 30 more log lines (full detail in attached log) but they are triggered by further requests done by open-webui perhaps attempting to resume the chat by automatically repeating a GET /api/tags and POST /api/chat to give conversation context to ollama
Anyway, if the "GGG.." thing happened, then at this point ollama server process has almost always gone to 100% CPUs and stays there forever unless explicitly killed. (In very rare cases when the "GGG" thing happens, the ollama server does not go to 100% CPU). On the contrary, when the "GGG" thing does not happen, ollama server always goes down to 0% waiting for new requests as expected.
4-chunk.txt
@rick-github commented on GitHub (Dec 10, 2024):
What's the output of
nvidia-smiafter the model is loaded?@rick-github commented on GitHub (Dec 10, 2024):
Does the model respond properly if you ask it a question (Why is the sky blue?) before you upload an image?
@MDev-eng commented on GitHub (Dec 10, 2024):
@MDev-eng commented on GitHub (Dec 10, 2024):
@MDev-eng commented on GitHub (Dec 10, 2024):
Normally this model answers correctly when used with text questions.
Problems occur when images are uploaded and involved in the chat.
In the example, a first text-only question is answered properly. Then an image is loaded, and as it occasionally happens, an answer is given (albeit incorrect). Then another image is loaded, and the system enters the corrupt state. From then on, all questions (including text questions) either are not answered (timeout) or they are answered the same way - "GGGGGGGGGGGGGGGGGGG...."
And usually, once this state is reached, the "ollama" process stays at 100% CPU until killed.
@rick-github commented on GitHub (Dec 10, 2024):
Logs from this session?
@MDev-eng commented on GitHub (Dec 11, 2024):
Here it is.
As usual, there is a "prediction aborted, token repeat limit reached" message when things go crazy.
bigsession-log.txt
@rick-github commented on GitHub (Dec 11, 2024):
Does the behaviour change if you do the same actions through the CLI? eg:
@MDev-eng commented on GitHub (Dec 22, 2024):
The behavior does not change. Either when answering to the question about the first picture, or when answering to the second, it will output GGGG's, ollama process goes to 100%CPU and stays there forever and must be killed manually.
So the problem is not caused by open-webui.
Below the console-based chat session. Attached the log of the session. In the log there is the usual message "prediction aborted, token repeat limit reached" shortly before the GGGG's begin appearing.
console-based-session-ollama-log.txt
@MDev-eng commented on GitHub (Dec 22, 2024):
Above log is from ollama 0.5.1.
I upgraded to latest ollama 0.5.4 and behavior is unchanged.
So based on what we are seeing so far, it seems the problem is
1- inherent to ollama, not due to open-webui
2- occurring even with latest ollama version (0.5.4 at time of writing this) with exact same message in ollama log shortly before GGG's begin appearing
3- reproducibly occurring when all of the following hold: model is llama3.2-vision, question is about a picture, and the hardware has two GPUs on which ollama spreads the LLM to take advantage of the overall memory.
@rick-github commented on GitHub (Dec 22, 2024):
Would it be possible for you to try older versions of ollama to see if this is a result of a version change? 0.4.0 is the oldest that will support llama3.2-vision. There's another open issue where the user also has a multi-GPU setup and sees token generation issues starting with 0.5.0.
@MDev-eng commented on GitHub (Dec 26, 2024):
Installed Ollama 0.4.0, repeated test, same behavior (outputs GGGG's as soon as I place a question about the image).
As usual, in the log there is the message
msg="prediction aborted, token repeat limit reached"
just before it starts outputting gibberish G's.
@rick-github commented on GitHub (Dec 26, 2024):
Thanks. Just to clarify, "prediction aborted" is a symptom, not an indication of the cause. What's happening is that the runner is going off the rails and generating a bunch of the same tokens (GGG in this case). The server detects this sequence and determines that the runner has got into a bad state so it stops listening to the runner. It doesn't kill the runner as the act of no longer listening is supposed to reset the runner. This doesn't appear to be working, perhaps in part due to the switch to go runners in 0.4.0. What we have to determine is why the runner gets in this state. Previous occurrences of this were related to the prompt + output tokens exceeding the context window - since generation is a feedback loop, something that disrupts the feedback (running out of token space) can cause the response to lose coherence and start generating rubbish. I've tried testing this locally using images clipped from your screenshots and it works fine for me, but it's worth testing it in your environment. Do the same CLI test as before, but set a large context window at the start:
@MDev-eng commented on GitHub (Dec 26, 2024):
With a context window of 16 K the behavior is even worse: the server goes crazy not at the second question, when asked about a picture, but already at the first question (non-picture related)
@MDev-eng commented on GitHub (Dec 26, 2024):
There is one difference though: the "ollama" process did not go to 100% CPU. With previous setup (default context window), the output of gibberish G's was invariably associated with ollama process at 100% CPU. With 16K ctx window it seems to be no longer the case.
@MDev-eng commented on GitHub (Dec 26, 2024):
Same happens with ctx size = 32768
With ctx size=65536, still no joy, but something interesting happens. First, about 20 seconds delay before any output begins (normally the delay is only 3-4 seconds). And during this time, only 2MB of model were loaded on the second GPU, and nothing on the first GPU (normally, roughly half model gets loaded on each GPU). Then, gibberish G's begin to appear but very slowly; while they appear, ollama process goes at 175% CPU, and when they stop appearing, I noticed that about 11 MB were loaded on EACH GPU. Normally, the 11B model got split almost evenly between the two GPUs (for example 5 GB on the first and 7 GB on the second). So I don't know where these 11+11=22 GB come from. And I wonder why this happens just because I set the context size to 64K.
Finally, with 64K context, the ollama process is left at 100% after the gibberish G's have stopped appearing (whereas with 32K, it went back to idle state at the end of the test)
@MDev-eng commented on GitHub (Dec 26, 2024):
With 1K and 4K context size, behavior is like with 16K or 32K.
@rick-github commented on GitHub (Dec 27, 2024):
Thanks for trying other context sizes.
Just another clarification: the CPU going to 100% during GPU inference is expected. The synchronization mechanism between the CPU and the GPU(s) is a busy wait. The processors use a bit of shared memory to communicate, so when it comes time to perform an inference, the CPU sends the commands to the GPU and then spins on the shared memory waiting for the GPU to say 'yes, finished that command, what next?', the CPU sends another command and then goes back to spinning, etc, until the inference is complete.
I'm surprised that varying the context size causes immediate breakdown. Would it be possible for you to add the logs from the 16K and 64K experiments?
@MDev-eng commented on GitHub (Jan 6, 2025):
100% CPU usage for a polling loop seems a bit a waste of computing power. As results come from GPU only after a little (but not tiny) time, it does not seem necessary to poll billions of times per second. Perhaps 100 times a second would be more than enough. Maybe put a millisecond sleep somewhere in that loop ?
@rick-github commented on GitHub (Jan 7, 2025):
To clarify further, this is a feature of the Nvidia driver and HIP (ROCm) driver. There are discussions on llama.cpp about this (eg https://github.com/ggerganov/llama.cpp/issues/8684) but it hasn't changed so I assume there's a reason to keep this approach.
@gionkunz commented on GitHub (Feb 18, 2025):
I had the same issue printing "GGGGGG" after a few chats and after that also in new sessions I only got "GGGGG". I overclocked my GPU (undervolting and +1200MHz mem clock) and suspected that this could be the problem. While running stable in other cases, the AI workload seemed to be destabilize the GPU in that overclocked state. As soon as I tuned it back to factory settings, the issue was gone.
@MDev-eng commented on GitHub (Feb 18, 2025):
Unfortunately in my case there is no overclocking to undo.
My setup uses two 12GB RTX3060's in absolutely stock condition.
So the quest for the root cause continues....
@arazdow commented on GitHub (Sep 25, 2025):
I'm getting the GGGGGGGG on a single GPU - NVIDIA Jetson Orin Nano board (ARM based). Not always, but often. Running llama3.1:8b.
@rick-github commented on GitHub (Sep 25, 2025):
https://github.com/ollama/ollama/issues/12209