mirror of
https://github.com/ollama/ollama.git
synced 2026-05-06 16:11:34 -05:00
Closed
opened 2026-05-04 14:19:24 -05:00 by GiteaMirror
·
27 comments
No Branch/Tag Specified
main
dhiltgen/ci
parth-launch-plan-gating
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#68533
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @AlbertoSinigaglia on GitHub (Mar 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9890
What is the issue?
Just installed Gemma3, with
context length 131072, and just learned that it means nothing to Ollama, since it still works with 2048 context size if not specified otherwise.So, if i run it with the default context, it runs smoothly and loads the model correctly on a single GPU, and outputs what's is supposed to in a matter of seconds.
and has no problem at all at answering.
However, as soon as I run
/set parameter num_ctx 128000, it shards the model across GPUS, and never answers ever again.Context:
I'm running it on a server with 3x A6000, using the following config in
systemctl edit ollamaRelevant log output
OS
Linux
GPU
Nvidia
CPU
Intel
Ollama version
0.6.0
@AlbertoSinigaglia commented on GitHub (Mar 19, 2025):
To be fair, the model after like 5 minutes answers, yet it works at 1t/s, where initially is around 10t/s... I would understand it if the context was actually 128K, but since both the versions work with almost no context in input, feels weird that with a longer context it has such a slowdown
@rick-github commented on GitHub (Mar 19, 2025):
This is tripling the size of the context that ollama allocates on the GPU. This is causing a bunch of layers to be loaded in to system RAM, and inference is much slower. Server logs will show how much VRAM the increased context size is consuming.
@rick-github commented on GitHub (Mar 19, 2025):
https://github.com/ollama/ollama/issues/9791#issuecomment-2727576383
@ALLMI78 commented on GitHub (Mar 19, 2025):
"the model after like 5 minutes answers, yet it works at 1t/s"
Thats exactly what i was seeing in my setup with a 32k context... I feel like there are still issues with larger context sizes, but since I can't break down my entire workflow into a reproducible example, I’m having trouble presenting the problem. Maybe you can... I also wasn’t sure if my GPU (4060 Ti 16GB) is simply too weak for Gemma 3.
@rick-github commented on GitHub (Mar 19, 2025):
More context == less room for model weights in VRAM. Less room for model weights in VRAM == model weights in system RAM. Model weights in system RAM == slower inference.
@ALLMI78 commented on GitHub (Mar 19, 2025):
Dear Rick,
You're clearly the pro here, and I don’t mean to challenge you, but if someone with 3x A6000 GPUs can’t get Gemma 3 running properly, is that really expected behavior?
Your explanation makes sense and is technically correct, but again—I can run 14B models on my GPU with a 32k context without any issues, yet I can't get Gemma 3 (12B) to work, experiencing exactly the symptoms described here.
I remember reading somewhere that Gemma 3 requires additional memory for image-related tasks, but I can’t find that source anymore. That could be an explanation…?
@AlbertoSinigaglia try 0.6.2
@AlbertoSinigaglia commented on GitHub (Mar 19, 2025):
@rick-github, thanks for the clarification; I switched to
Environment="OLLAMA_NUM_PARALLEL=1"to see if it makes a difference. I'm uploading a log file with everything saved (for reference, the server has 512 Gb of ram and 64Gb of swap)Unfortunately, the same happens:
To be fair, the 5-minute wait time is only for the first generation after setting the context size to 128k, though the 1t/s speed still remains.
@ALLMI78 I'm actually running the 27B, yet the behavior is the same using Phi4 with 128k context (which is a 14B model)... I'll try with the newest ollama version
logs.txt
@ALLMI78 commented on GitHub (Mar 19, 2025):
PHI4 with 128k ? sure? does it support that? i only saw 16k versions until now...
to compare you can also try qwen-14b, they run fine for me...
https://ollama.com/library/qwen2.5 (supports up to 128K tokens and has multilingual support)
@rick-github commented on GitHub (Mar 19, 2025):
A token has a different sizes for different models, so 32k context for phi4:14b is 6.2G and for gemma3:12b is 12G. If a model needs to be sharded across several devices, the amount of VRAM required goes up because a copy of the graph needs to run on each device. gemma3 has additional memory requirements for the image projector, so gemma3 is going to consume more VRAM, pushing model layers into system RAM where inference is slower.
In this log,
OLLAMA_NUM_PARALLELis 3. This increases the context buffer that ollama allocates to 384000 tokens. That needs 182G. As a result, ollama can only load 12 of the 63 layers of the model in VRAM, the rest being loaded in system RAM where the slower CPU does inference. IfOLLAMA_NUM_PARALLEL=1, context cache will fall to approximately 60G, leaving 120G for loading extra model layers in VRAM.Changing the context size causes a model reload, that's why the first inference after changing
num_ctxtakes a while.@AlbertoSinigaglia commented on GitHub (Mar 19, 2025):
@ALLMI78 Sure, you are right, though I was not looking for a nice generation, only compute time requirements, and I had only that one as a "small" model.
@rick-github, you're right; I've mistakenly changed num models from 3 to 1 instead of num parallel (ops...)
Now the loading is pretty fast, though it allocates 20Gb for each GPU... which aligns with your 60Gb of predicted memory usage. However, It's still not quite not clear to be why ollama should preallocate the whole 128k context instead of dynamically creating it based on the actual context (so starting from 0 to N with time)... Does it have anything to do with KV caching?
@AlbertoSinigaglia commented on GitHub (Mar 19, 2025):
[OT] @ALLMI78 https://huggingface.co/microsoft/Phi-4-multimodal-instruct WELL, new version just dropped eheh
@ALLMI78 commented on GitHub (Mar 19, 2025):
@rick-github Dear Rick,
Thank you for the detailed explanations. Yes, you had already explained the
OLLAMA_NUM_PARALLEL = 1/3, and I understood that.I had already noticed that different models use different tokenizers. For example, i saw that Llama models require significantly fewer tokens than Qwen models (about 30-40% less). I understood all your explanations and think everything is absolutely correct—except for one point…
With a Gemma 3 12B Q4_K_M and 32K context, I am seeing 24 GB memory usage. The model itself is around 8 GB, so that would mean another 16 GB for the 32K context, including the image projector?
For Qwen 14B Q4_K_M with 8GB size, I get 16-17 GB memory usage with 32K context, so 8 for the model and 8 for the 32k context.
Because they told us "The current, most capable model that runs on a single GPU. " and since there were initial RAM issues with Gemma 3, I remained a bit skeptical. I'm not able to run the 12b (or only @ 1t/s) and the 4b does also not work for me, it answers crap.
I trust you—if you say everything fits and is normal, I accept that. I just noticed that the symptoms here were exactly the same as for me before: too much memory usage → swapping to CPU → slow speed (~1 token/s), exactly like in my case.
@AlbertoSinigaglia
https://huggingface.co/models?sort=trending&search=phi-4+multimodal+gguf
@ALLMI78 commented on GitHub (Mar 19, 2025):
### OT -> preallocate
I think it won’t work without preallocate or it is hard to get a stable solution. You can’t change the context size during runtime.
Imagine you start with 2K context and you later want to increase it, but in the meantime, the user has loaded another software that blocks VRAM, or some process is using it.
If you now need to dynamically increase memory usage because the context grows, it won’t work—suddenly, there’s no memory available.
A simple explanation—if something is incorrect, Rick can probably explain it better. ;)
@AlbertoSinigaglia commented on GitHub (Mar 19, 2025):
@ALLMI78 maybe they meant an H100 as "single GPU"...
Anyway, it still takes a lot, and also the Gemma models afaik are notorious for absurd tokenizers, because they work great on TPUs (not as much on GPUs), at least that's what I was told
On the GGUF, I'm still not quite at the point were I'm comfortable "making a GGUF version", still quite new to the world of LLMs. (as an OT, does it make such a large difference?)
About the RAG... sort of, it still takes a while for the first generation. Connecting to your last comment, I'm not sure that is just the allocation the problem. Allocating 40GBs in the GPU takes just few seconds, loading a 40Gb model in GPU takes like 8/10 seconds, but allocating a 20GB model with a large context (overall 40Gb mem required) takes waaay longer...
@ALLMI78 commented on GitHub (Mar 19, 2025):
Off-topic: The H100 costs around 38,000 euros here—I have no idea what Google expects. ;)
I’m already amazed how some people (like you) have three GPUs, when even one costs 5,000 euros.
Do you all buy them used, steal them, or am I just too dumb or too poor? ;)
@AlbertoSinigaglia commented on GitHub (Mar 20, 2025):
@ALLMI78 sooo I've re-downloaded all models in their GGUF version, I'm not seeing a major different in load speed, but a small speedbump on the generation side (though the Gemma3 model I was testing was already a GGUF version)... regarding the GPUs, the answer is "I'm in academia"
@rick-github Any suggestion on how to reduce the slowdown due by the context size? Is it impossible to have a sort of "gradual" allocation of the context size? For example, allocating 8k, and once the model used the first 4k, you get other 8k token allocation and so on?
@rick-github commented on GitHub (Mar 20, 2025):
It's easier. The positional encodings used in the context buffer are based on continuous sine/cosine functions and the attention mask is computed on-the-fly so there's no technical reason for constraining the window (other than the training size of the model), but it makes it easier to manage. For most applications, allocating the max size on the GPU is simpler than trying to manage memory and competing with other clients. The downside is that if you need more context than is available on the GPU, you get a performance hit. There are mechanisms to cope with this like flash attention and sliding window optimization (#9892). gemma3 is a new model and has some unique architectural features so there's some teething problems as it's integrated with the also new architecture of the ollama runners, but the next few releases should see improvements.
context 12G, model graph 1.1G, projector 0.8G, projector graph 1G.
As Alberto says, what an enterprise considers a GPU is different to consumer hardware offerings.
That's not to say there aren't issues. 0.6.1 uses too much system RAM, which has been addressed in 0.6.2. There are still issues with large VRAM allocations happening during inference, crashing the runner (#9791). q4_0 and q8_0 cache quantization has a significant performance hit (#9683) for gemma3.
Flash attention and the sliding window PR should improve things.
The current architecture doesn't support that in ollama, but this is simple enough to manage in the client. When the client sends a request, if can set the context via
num_ctx. The response will indicate how many tokens the prompt and the response took, and the client can adjustnum_ctxif required. This will cause a model reload when the size crosses the maximum available on the GPU, but the model is already in page cache so the reload time will be short(ish).@ALLMI78 commented on GitHub (Mar 20, 2025):
Wow, thank you so much, Rick, for the detailed insights and the links! The point about "Sliding Window Optimization" is particularly interesting. I've often wondered whether some of the techniques used by Unsloth (they write something about 6x times more context size and so on...) could also be applied in Ollama. I'm not deeply familiar with the topic, but maybe there's still some potential there? Perhaps they are already collaborating with you, or at least exchanging ideas—if that’s not happening already?
"Context: 12 GB, Model Graph: 1.1 GB, Projector: 0.8 GB, Projector Graph: 1 GB."
Very interesting, thanks! I was just a bit confused because I couldn't get a "small" 12B model to run. I mean, if you try loading a "no-name" 12B model and it doesn’t work, that’s kind of expected. But since Gemma2 worked great for me—until the limited context size became an issue—I was really excited about Gemma3. And then came the disappointment. You spend hours trying different things, and nothing works. Of course, that’s part of the game, but I honestly didn’t expect it in this case.
The idea of managing this from the client side is interesting—thanks for the inspirations and for taking the time to explain everything! It was really insightful. :)
@ALLMI78 commented on GitHub (Mar 20, 2025):
Offtopic – Personal
Rick, thank you—and thanks to everyone else here—who contributes out of passion and a genuine desire to help. The topic of AI is far too important to be left solely in the hands of a few ultra-rich individuals or hidden behind paywalls. Every small explanation, every constructive discussion, every shared image, and every bit of information about how things work helps to open the door to this new world for non-academics like me. Keep up the great work! :)
@rick-github commented on GitHub (Mar 20, 2025):
LLMs are an evolving field, and still relatively young. There's lots of research going into improving training and inference, as those ideas and implementations mature they will be integrated into various code bases.
Yeah, as a small, capable model that does vision and (unofficially) tools, this is looking like a great foot soldier for the upcoming Agentic Wars. I think some of the problems actually come down to timing - a new model with novel architectural features and a revamp of the ollama runner architecture at the same time resulted in a less than stellar launch.
@AlbertoSinigaglia commented on GitHub (Mar 21, 2025):
I deeply share the gratitude of @ALLMI78, @rick-github thanks again for the explanation. I actually came here from this issue #11828, and only later realized that the problem was the context length, so I'll probably open a new issue on OpenWebUI to see if they can manage to implement that "simple" dynamic context length
I'll keep an eye on #9892, looks very promising
@AlbertoSinigaglia commented on GitHub (Mar 23, 2025):
Hi @rick-github, i've seen the pre-release with the sliding window... may I ask why that's available only for gemma3? For example, also Llama3.3 Instruct has 130k context length. From a "Web client perspective", I think it would be nice to have a request parameter to request fixed or dynamic allocation of the context, like what
num_ctxdoes now for the contextSide question: how do you recognize if a model is a Gemma3 model? E.g. would that work also with unsloth gemma3 model?
@rick-github commented on GitHub (Mar 23, 2025):
It's a feature of the modified architecture that gemma3 has. To quote from the technical report:
Architectural changes can't be backported to an existing model, but now that the Deepmind team have demonstrated the advantages of this approach, new models may adopt it. Probably not llama3.4, maybe llama4. Or they could come up with their own architectural tweaks.
It should work for all derivations of gemma3. I'm not familiar with what unsloth does to their model releases, so I can't say "will work".
@AlbertoSinigaglia commented on GitHub (Mar 24, 2025):
@rick-github I though that the PR was aiming at building something like vLLM Paged Attention, which (at first glance) looks like the "more sound" approach to this problem, instead of relying on Google (or whoever in other cases), which might not care that much given their hardware availability
Do you feel that Paged Attention might still be coming to Ollama?
@jessegross commented on GitHub (Mar 24, 2025):
Sliding window attention and paged attention are mostly orthogonal - they can done separately or together.
Paged attention mostly helps with memory management in multi-request scenarios, whereas sliding window attention reduces the effective context size (memory/GPU cost) for each request. The latter is the one that is more helpful for the discussion here.
As Rick said, sliding window attention is an architectural feature of Gemma and not something that can just be turned on for other models. In fact, sliding window attention was implemented in Gemma from the first release, the upcoming release just has a more optimized implementation of it.
@AlbertoSinigaglia commented on GitHub (Mar 24, 2025):
@jessegross citing the original paper:
I'm not expert enough to be confident about it, but I'd throw there an educated guess, saying that I don't see how this can be extended to same-sequence/single-request scenario. At least from the paper, it seems just an efficient way to "see" a non-contiguous tensor as contiguous, thus allowing to first allocate (for example) an 8k context vector, and then increase it if the LLM is getting close to that limit, without having to reload the whole LLM or the tensor (instead, you just need to allocate a second 8k long tensor to use as context extension).
Though, if this is not the case, feel free to correct me.
@AlbertoSinigaglia commented on GitHub (Apr 3, 2025):
Came back here to this closed issue just to say that the new Gemma3 runtime is amazing, and all of you maintaining this project are such amazing, thank you so much!!!
(@rick-github)