mirror of
https://github.com/ollama/ollama.git
synced 2026-05-06 16:11:34 -05:00
Open
opened 2026-05-04 19:27:16 -05:00 by GiteaMirror
·
48 comments
No Branch/Tag Specified
main
dhiltgen/ci
parth-launch-plan-gating
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#69814
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @VictorWangwz on GitHub (Aug 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11714
What is the issue?
The original model coudl run without problem, but the gguf model fail to run for below errors
May need an update of ggml dependencies like llama.cpp https://github.com/ggml-org/llama.cpp/pull/15091
Note: Running gguf on llama.cpp without problem.
Relevant log output
OS
No response
GPU
No response
CPU
No response
Ollama version
No response
@a-makarov-kaspi commented on GitHub (Aug 6, 2025):
Yep, ollama returns 500 error
Ollama 0.11.2 for mac OS
@teodorgross commented on GitHub (Aug 6, 2025):
Yes' it's not working
@ramkumarkb commented on GitHub (Aug 6, 2025):
I can also confirm that the same error with the latest vesion -
0.11.3-rc0I tried with the following GGUF models -
@maksir commented on GitHub (Aug 6, 2025):
maybe will help
@teodorgross commented on GitHub (Aug 6, 2025):
@rbnhln commented on GitHub (Aug 6, 2025):
Got the same error:
Tried different models and sizes:
ollama version: 0.11.2
The gpt-oss_20b model is running.
@discostur commented on GitHub (Aug 6, 2025):
Same error here:
@kappa8219 commented on GitHub (Aug 6, 2025):
Some more:
Maybe it is unsloth guys problem. Cause "original" works fine. Or inddeed some GGML lib problem.
@kappa8219 commented on GitHub (Aug 6, 2025):
Quite intriguing what would be the result of quantitazing of already prequantized model :)
@snowarch commented on GitHub (Aug 6, 2025):
Yes.. Ollama always tends to screw up the GGUF models. I'm using the one from unsloth with llama.cpp and it works great.
@teodorgross commented on GitHub (Aug 6, 2025):
It's working well in others tools except ollama
@musica2016 commented on GitHub (Aug 7, 2025):
Same problem
Ollama versions is 0.11.3-rc0
System: ubuntu
@406747925 commented on GitHub (Aug 7, 2025):
same problem
ollama version is 0.11.3
CUDA Version: 12.4
tesla v100
@niehao100 commented on GitHub (Aug 7, 2025):
same problem
ollama version is 0.11.3
rocm version 1.15
RX7700xt
@billchurch commented on GitHub (Aug 7, 2025):
I'm experiencing the same issue with the unsloth gpt-oss-20b model on my AMD setup. Here are my system details:
Environment:
Model tested:
hf.co/unsloth/gpt-oss-20b-GGUF:F16Error reproduced:
Relevant logs from journalctl:
The error is identical to what others are reporting - the tensor
blk.0.ffn_down_exps.weighthas an invalid ggml type 39 (NONE). This happens on AMD ROCm setup as well, not just CUDA environments.Other models like gpt-oss:20b, llama3.2:1b, phi3:mini, and gemma3n:e4b work perfectly fine with 100% GPU offloading on this same setup.
@srd00 commented on GitHub (Aug 7, 2025):
apparently the solution is to update llama.cpp https://huggingface.co/unsloth/gpt-oss-20b-GGUF/discussions/11
@mabhay3420 commented on GitHub (Aug 7, 2025):
on apple silicon, following version installed with brew works file.
thanks @srd00 for the suggestion to use different version.
previously i was getting the error:
@Kira-PH commented on GitHub (Aug 9, 2025):
I get this on mradermacher/gpt-oss-20b-uncensored-bf16-GGUF:Q8_0 and DavidAU/OpenAi-GPT-oss-20b-abliterated-uncensored-NEO-Imatrix-gguf:Q8_0
@billchurch commented on GitHub (Aug 9, 2025):
@HughPH There’s definitely something missing from some ollama distributions, 0.11.3 produces the same errors for me for all gpt-oss variations except the original from OpenAI. I compied llama.cpp and I’m not having these problems. Ollama 0.11.3 should support this but it’s not working on my linux distros.
@Teravus commented on GitHub (Aug 10, 2025):
I have this issue too with the unsloth versions on Ollama.
llama_model_loader: - kv 42: quantize.imatrix.entries_count u32 = 193
llama_model_loader: - kv 43: quantize.imatrix.chunks_count u32 = 178
llama_model_loader: - type f32: 289 tensors
llama_model_loader: - type q5_1: 1 tensors
llama_model_loader: - type q8_0: 169 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 20.55 GiB (8.44 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gpt-oss'
llama_model_load_from_file_impl: failed to load model
time=2025-08-09T17:20:47.030-07:00 level=INFO source=sched.go:453 msg="NewLlamaServer failed" model=
I don't have anything to add except, 'me too'.
@ggerganov commented on GitHub (Aug 10, 2025):
Since none of the maintainers here seem to care enough to explain the actual reason for ollama to not support the HF GGUF models, while the root cause is pretty obvious, I will help explain it:
Before the model was released, the ollama devs decided to fork the
ggmlinference engine in order to implementgpt-osssupport (https://github.com/ollama/ollama/pull/11672). In the process, they did not coordinate the changes with the upstream maintainers ofggml. As a result, the ollama implementation is not only incompatible with the vast majority ofgpt-ossGGUFs that everyone else uses, but is also significantly slower and unoptimized. On the bright side, they were able to announce day-1 support forgpt-ossand get featured in the major announcements on the release day.Now after the model has been released, the blogs and marketing posts have circled the internet and the dust has settled, it's time for ollama to throw out their
ggmlfork and copy the upstream implementation (https://github.com/ollama/ollama/pull/11823). For a few days, you will struggle and wonder why none of the GGUFs work, wasting your time to figure out what is going on, without any help or even with some wrong information. But none of this matters, because soon the upstream version ofggmlwill be merged and ollama will once again be fast and compatible.Hope this helps.
@fitlemon commented on GitHub (Aug 10, 2025):
Absolutely legend ;). Cause of this problem I migrated from ollama to llama.cpp.
@ericcurtin commented on GitHub (Aug 11, 2025):
Note all this stuff is a one-liner with docker model runner:
docker model runner uses llama.cpp, is open-source and open to contributions just like llama.cpp . Get involved where appropriate.
Another neat feature is they are stored as OCI artifacts, so you can push these models to any old OCI registry.
@Teravus commented on GitHub (Aug 11, 2025):
While I think a lot of this is true, did a speed comparison, and, don't agree with you about llama.cpp doing it faster.
I downloaded and experimented with the cuda version of the latest llama.cpp (b6123) using:
Then loaded the model with
llama-server --port 9001 -hf # ggml-org/gpt-oss-20b-GGUFand then pointed open-webui's OpenAI compatible connection at the v1 endpoint.
It worked with the ggml-org mxfp4 model.
Also, it returned about 5 tokens per second with a 4096 context.
Ollama's default context length is also 4096, and the official model from the ollama repo (assuming it is the mxfp4 version) there runs at about 72 tokens per second on ollama..
There are two 3090s in this machine, and 128GB of system memory and an intel 11900k, and.. while, it did use a tiny amount of resources on one graphics card, it preferred using the CPU for some reason. It even logged that the GPUs exist;
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free
prompt eval time = 894.66 ms / 24 tokens ( 37.28 ms per token, 26.83 tokens per second)
eval time = 93704.28 ms / 556 tokens ( 168.53 ms per token, 5.93 tokens per second)
total time = 94598.94 ms / 580 tokens
srv update_slots: all slots are idle
I appreciate the work that you've done. You've done an absolute technical marvel.
And, the reason that people use things like, ollama, is because they make it easy to use. They're adding ease of use to the mix. It's a different target audience.
There might be an arcane incantation that is documented somewhere that will make it work better easier and better on this machine. And, Ollama just.. loads it in a reasonable way without me having to think about it.
I also tried to run some of the unsloth versions of the models and the llama-server would just kind of freeze up between prompts.
Open-WebUi made the request, and some text showed up in the console, but it stopped generating text after the first prompt. Killing the llama-server process and re-running the startup command made the server come back up and it was able to respond one more time, before freezing again.
Just saying, that's my experience.
@ngxson commented on GitHub (Aug 11, 2025):
@Teravus We are actively working on the problems that you mentioned, just give us a bit of time.
Having both best performance and good UX is not an easy task, given the community-driven nature of llama.cpp. Some of llama.cpp maintainers even have to work during their vacations just to have someone else copy their work without giving any credits.
@ericcurtin commented on GitHub (Aug 11, 2025):
Don't have Nvidia hardware but I think there are two key flags you are missing, try toggling them to see if they help (flash attention and cache-reuse):
@pwilkin commented on GitHub (Aug 11, 2025):
I mean, you didn't tell llama.cpp to use the GPU, can't ready complain too much 😄
Try
llama-server -fa -ngl 99 --jinja --port 9001 -hf # ggml-org/gpt-oss-20b-GGUFand tell us how that benchmarks.@ericcurtin commented on GitHub (Aug 11, 2025):
Ha yes
-ngl 999is pretty crucial 😄@ericcurtin commented on GitHub (Aug 11, 2025):
I know
--cache-reuse 256has been recommended by @ggerganov in the past, don't have Nvidia hardware myself, so don't know how significant it it.@ITankForCAD commented on GitHub (Aug 11, 2025):
Better yet, use the dedicated benchmark binary provided by the fine folks at llama.cpp : llama-bench
@mudler commented on GitHub (Aug 11, 2025):
Worse part of all of this that nothing will change in the long run. This has been already the case for long time, just to cite another case is llama multimodal capabilities - and have mixed feelings on ollama exactly for this reason, it would have been much better if all projects that depend on @ggerganov's and the ggml team work would have upstreamed the contributions directly so anyone in the ecosystem could benefit, and avoid vendor lock-in and the duplicated efforts everywhere.
For instance, in LocalAI - and as well like in LM Studio and Docker you will find everything to "just work" because of working as a community, and actually giving credits to whom belong ( you guys really rock ;) ), consuming llama.cpp and reporting issues and upstream any change directly there.
It is quite frustrating to see that the Open source scene is really getting derailed lately by this kind of bad attitude.
@Teravus commented on GitHub (Aug 11, 2025):
The fact that this is an inside joke, and, that people are laughing about, is, exactly the problem. This wasn't communicated effectively to me on first use. And, it didn't decide to do this sensible thing itself given the situation that it was running in.
I think ngxson gets it. I agree that doing both the technical solution and having something that is easy to use is hard.
I see people saying that "X took my work without adding anything... it's just worse". I'm saying that they are adding something. They're working on the problem of from a 'it's too hard to get it to work well for most people' angle. My 'first user experience with llama-server' is documented here. Had the program analyzed the hardware and the model and determined which layers to offload to the gpu, the post, above, probably wouldn't be there.. because.. it would have been, at the very least, somewhat friendly and picked sensible defaults given the situation that it is running in. That takes analysis and development effort to write something to analyze the model and the resources in the environment that it is running under and make sensible (maybe non-optimal, but sensible decisions).
There's only so much time in a day. There are only so many developers who are working on this and they have limited time. They work on this during their vacations. That's exactly the point. The target audience is different. If I wanted to spend time optimizing reading the documentation and experimenting which layers are best offloaded to the GPU, or bother the developers with lots of questions.. then ollama might not be the best to use. However, Ollama isn't just marketing itself better, it's focusing on a different kind of user.
@ericcurtin commented on GitHub (Aug 11, 2025):
Just trying to help out @Teravus not make fun of you! I missed that
-ngl 999was missing from that command line too, it's no big deal, I made the same mistake as you here, I was actually poking fun at myself 😄One of the things docker model runner, local ai and lm studio try to do, is set the correct flags in llama-server under the hood, so users require less instructions. They are all users of upstream llama.cpp
@ericcurtin commented on GitHub (Aug 11, 2025):
@ggerganov @ngxson could we document reasonable defaults somewhere in upstream llama.cpp as a short-term solution? Kinda like how q4_k_m is a reasonable default for gguf? In my head this is what I have. Even if they are not perfect and don't fit every little use case it's better than nothing, I volunteer to open a PR, I need help with the info though! I don't have enough experience with the various stacks and hardware to know exactly. One thing I do know from running CPU inferencing on an Ampere machine with a tonne of CPU cores is "--threads (number of cores/2)" seems like a reasonable default:
CPU:
llama-server --jinja --cache-reuse 256 --threads (number of cores/2) -hf some-model
CUDA:
llama-server --jinja -fa -ngl 999 --cache-reuse 256 --threads (number of cores/2) -hf some-model
METAL:
llama-server --jinja -fa -ngl 999 --cache-reuse 256 --threads (number of cores/2) -hf some-model
ROCM:
?
VULKAN:
?
OPENCL:
?
MUSA:
?
CANN:
?
BLAS:
?
@pwilkin commented on GitHub (Aug 11, 2025):
@Teravus I mean - yeah, you're right. It's very often a case of first UX impressions. I have the same problem with vLLM you have with llama.cpp - it's extremely annoying to work with because getting to the sane options for my setup takes a lot of time.
For what it's worth, the llama.cpp guys have already mentioned having a "try reasonable defaults for user setup" default launch - but of course that's easier said than implemented. And yes, having more informative error messages (i.e. instead of "Error X" having "Error X, you might want to try this and this") would probably help as well.
But this thread is about something entirely different. If you build upon a technology, in the OSS world it's a good habit to actually contribute back to the technology you use if you build something new. Another good habit is to work together if your solution would add some new feature to the code - and only fork if it's absolutely obvious that you're not going to be able to work together (that does happen for entirely valid reasons - the original code owners might have another vision of how to do various changes; or they have another vision of what should be added first; or they are doing their own refactor and don't want your code to mess things). But Ollama has, again and again, done the opposite of that - made hacky solutions of their own on top of existing llama.cpp / ggml code instead of contributing to the baseline, then taken the fixes that the ggml team has done as "new features" or "bugfixes" of their own platform. That's the thing @ggerganov is pointing out here.
@Teravus commented on GitHub (Aug 11, 2025):
Ollama, for sure, needs to provide something to the user that says that they're using code from llama.cpp. Usually, this is in an about box. I don't even see an about box. Therefore, doesn't look like they're complying with that. I only know that ollama uses llama.cpp under the hood from having model issues with some models and needing to manually prepare them with llama.cpp in order for them to work under ollama. It was, at that point, that I learned that the underlying technology, that was state of the art and that the 'good' implementations relied on, was llama.cpp. The only mention that I see of llama.cpp isn't really a 'we use this software' reference. It's just:
Supported backends
llama.cpp project founded by Georgi Gerganov.
This doesn't seem like enough. Skirting the issue, by treating llama.cpp like a back-end.
A deeper look down the rabbit hole, Looks like Ollama documents the "patches" that they make to llamacpp here.
https://github.com/ollama/ollama/tree/main/llama/patches
The last batch of them have been about gpt-oss.
Looks like, also, there may be some NDAs involved.
From a comment: "this is exactly how they prepped the 0.11.0 release w/o breaking OpenAI NDAs, it made a cuda crash refer to a nonsense line number."
I'm not sure how that affected the situation with gpt-oss.
@Teravus commented on GitHub (Aug 11, 2025):
I think you've been very reasonable.
I know that 'thumbs down' doesn't really mean anything in terms of effects on accounts or anything.
A side note, I still think it's funny how many people 'thumbs downed' the documentation of a first user experience.
People can 'thumbs down' anything they want. It doesn't affect anything from a comment or account perspective. It doesn't mark it as spam. Just, that they don't like it.
I wonder if they realize that they're actually thumbs-downing the result of the first-user experience. By the transitive, they're thumbs downing the first-user experience.
@ericcurtin commented on GitHub (Aug 11, 2025):
The problem with reactions is sometimes people are reacting to a portion of the post. It's being given a thumbs down because this was a comparison between GPU accelerated Ollama and CPU based llama.cpp . GPU will beat CPU of course.
If you do GPU vs GPU, llama.cpp wins.
I think more documentation is a good idea.
@Teravus commented on GitHub (Aug 11, 2025):
Yep, after using:
llama-server -fa -ngl 99 --jinja --port 9001 -hf ggml-org/gpt-oss-20b-GGUFprompt eval time = 494.48 ms / 1651 tokens ( 0.30 ms per token, 3338.89 tokens per second)
eval time = 3214.37 ms / 386 tokens ( 8.33 ms per token, 120.09 tokens per second)
total time = 3708.84 ms / 2037 tokens
120t/s llama.cpp, 72/s Ollama
llama.cpp wins.
I'm going to leave the old post, unedited. I'm curious how many thumbs-downs it will get.
@JohannesGaessler commented on GitHub (Aug 12, 2025):
By necessity, downstream projects like ollama use rough estimates to choose the number of GPU layers. And to avoid OOMing these estimates need to be chosen very conservatively, leaving a lot of performance on the table. It is my opinion that in such cases it's better not to set the number of GPU layers automatically since otherwise a lot of users would be unknowingly using bad defaults. One of my current projects is to implement this automation properly in llama.cpp, maybe I'll bump it up in terms of priority.
@Teravus commented on GitHub (Aug 12, 2025):
I understand the perspective.
I guess it boils down to: "Who is the project serving?"
Who is the target audience?
llama.cpp is targeted at a different, more niche, user than ollama.
That, in and of itself, isn't a bad thing, as long as llama.cpp understands that many more users will flock to ollama first and ollama will be many people's first introduction to llms... because it's easier and less intimidating than llama.cpp. It will be more popular, and more people will use it as a result. The more niche users will continue to use llama.cpp directly. There may be some converts from ollama to llama.cpp who try ollama and want to do more, but, they'll be the same kind of niche users that llama.cpp attracts with its focus.
llama.cpp
ollama
If llama.cpp doesn't want that kind of user, then they're doing OK already.
What a lot of people in this thread don't want to hear is that, If llama.cpp wants to be 'more popular' or be user's first introduction to running llms locally, it has to serve that second type of user as long as something like ollama exists.
Easier = more users.
More detail = less, but, more invested, users.
A portion of the community who uses llama.cpp (not the core developers) suggests that all of the innovation comes from llama.cpp and ollama is just a thin wrapper around llama.cpp. If that's the case then it should be simple to 'just make it easier'. 'just make a wrapper that does it'.
Again, this isn't to say that Ollama is doing well. They're not in a few ways, but, one specific issue is: they're clearly using code that is from gguf and llama.cpp without correct MIT license attribution. The only mention of llama.cpp is in the readme.md as 'the one and only backend'. And, that's not enough to satisfy the license. At the very least, it needs an about-box in user-space that includes attribution to software that they use.
@Kira-PH commented on GitHub (Aug 13, 2025):
There are people in the middle of those extremes. I was the AI/ML domain architect at IFS. I've been coding professionally for some 25 years. I read the chat templates and code against the /generate endpoint, because I've been messing with LLMs since before the GPT3 closed beta, and I feel happier getting that bit closer to the model. But fundamentally I just want to offload the effort when I'm tinkering at home. If I was in a professional environment then I'd dig into llama-server and get to know it intimately, but for idly messing about I just want to load a model and go. This is like an F1 mechanic. When he's not at work, he just wants to get in his Audi and drive somewhere, not spend an hour preheating his tyres, tinkering with the engine and reading telemetry data.
@mkultra333 commented on GitHub (Aug 14, 2025):
So I've wasted hours and hours over two nights slowly downloading and re-downloading 15GB models from HF, only to find out the reason they aren't working is because Ollama is using botched, rushed code and didn't warn anyone that it was broken for the gpt oss models apart from their own?
Thanks Ollama.
Their implementation ran so damn slow anyway, getting a response from their own 20B model on my 5060ti is like watching paint dry as the all the work seems to be happening on the CPU instead of the GPU even though there's VRAM to spare.
Ugh. I'm trying LMStudio, see if that's any better. Pity I also have to now re-download the models AGAIN because for some reason Ollama has to put all the gguf in it's own weird blob format. This is so tedious.
@kappa8219 commented on GitHub (Aug 15, 2025):
With 0.11.5-rc2 things changed. GGUF-s are running. Still what I got is that "original" gpt-oss-20b is almost twice faster than Unsloth GGUF one. With comparable sizes 13G and 11G. Both 100% GPU.
@expnn commented on GitHub (Sep 4, 2025):
Ollama 0.11.8 runs successfully at first, but it crashes after generating a small amount of text. See my example here: https://github.com/ollama/ollama/issues/10993#issuecomment-3248383362
@shimmyshimmer commented on GitHub (Sep 5, 2025):
This is because we preset our GGUFs in Ollama with
top_k = 0which slows down the GGUFs a lot. In our testing when we remove the top_k setting, it scores the same results as Ollama.In the future we will be removing the pre-set
top_k = 0to maybe 64 or 128 instead.@JohannesGaessler commented on GitHub (Sep 6, 2025):
I don't know what ollama uses for sampling but in llama.cpp the issue with top-k 0 was that the fast custom bucket sort was only implemented for top-k so disabling top-k resulted in a fallback to the slower
std::sortfor the whole token array. The implementation was generalized in https://github.com/ggml-org/llama.cpp/pull/15665 + an optimization that first tries sorting only the top 128 tokens (which should be enough for most cases).@OracleToes commented on GitHub (Sep 27, 2025):
I'm still getting this problem, and it seems like from the conversation in this issue, we know how to fix it, so why is it that a month later we still can't run ggufs of gpt-oss models?
It's worth noting that the gguf models work in the playground, but not in the regular chat interface.