mirror of
https://github.com/ollama/ollama.git
synced 2026-05-06 08:02:14 -05:00
Closed
opened 2026-04-12 17:30:48 -05:00 by GiteaMirror
·
92 comments
No Branch/Tag Specified
main
parth-mlx-decode-checkpoints
dhiltgen/ci
hoyyeva/editor-config-repair
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
hoyyeva/launch-backup-ux
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
brucemacd/download-before-remove
parth/update-claude-docs
parth-anthropic-reference-images-path
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#6165
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Abdulrahman392011 on GitHub (Mar 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9457
What is the issue?
When I load the granite vision model that is 2.5 gigabit the ram and the ps command show 11 gigabit model running.
Also when I run the fp16 of granite vision (6 gigabit) it shows 15 gigabit in ram and in ollama ps command
Relevant log output
OS
No response
GPU
No response
CPU
No response
Ollama version
No response
@rick-github commented on GitHub (Mar 2, 2025):
2.5G (or 6G) is just the size of the model weights. ollama also needs memory for context buffer, model graph, etc. The larger the context buffer, the more memory is required. The default context size of
204816384 tokens results in a memory footprint of 5.2G so if you have 11G, you have likely used a larger context size.@Abdulrahman392011 commented on GitHub (Mar 2, 2025):
is there a way that I can check and see what the context size is?
I didn't tamper with any of the settings on purpose. but it's worth checking.
@Abdulrahman392011 commented on GitHub (Mar 2, 2025):
NAME ID SIZE PROCESSOR UNTIL
granite3.2-vision:2b-q4_K_M 3be41a661804 7.0 GB 100% CPU 4 minutes from now
@Abdulrahman392011 commented on GitHub (Mar 2, 2025):
NAME ID SIZE PROCESSOR UNTIL
granite3.2-vision:2b 3be41a661804 11 GB 100% CPU 4 minutes from now
@Abdulrahman392011 commented on GitHub (Mar 2, 2025):
mind you those two are the same model and same quantization.
@Abdulrahman392011 commented on GitHub (Mar 2, 2025):
NAME ID SIZE PROCESSOR UNTIL
granite3.2-vision:2b-fp16 17ca6aa97bd9 15 GB 100% CPU 4 minutes from now
@rick-github commented on GitHub (Mar 3, 2025):
Server logs will show the size of the context that the runner is started with, look for
--ctx-sizeand divide by--parallel. If you haven't modified settings, then your clients are passingnum_ctxin their API calls. Since granite3.2-vision:2b-q4_K_M was loaded with two different sizes, two of the clients are settingnum_ctxto different values.@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
server_log_ollama.txt
I've tried checking but I can't figure it out.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
I am not sure if this is related but I had a problem with moondream model. it returned a weird error. on the other hand LLava works fine. so it's not like an issue with all vision models or something like that.
it's also worth mentioning that I am running on cpu and the gpu in the laptop is old nvidia that doesn't have support for ollama.
also there is only 8 gigabytes of ram and I have like 50 gigabytes of swap-memory on the disk.
another thing is that I haven't confirmed that the granite-vision models actually work. when they load into the disk-ram, it becomes too slow and I lose patience and stop them.
note: the disk ram is for building apps from source. without them the device freeze and crash. but I don't actually use it for running LLMs (too slow)
@rick-github commented on GitHub (Mar 3, 2025):
ctx-sizeis 65536,parallelis 4, so context size is 16384. This is the default, so the clients aren't overridingnum_ctx. What's causing the large VRAM footprint is allocation of extra buffers for parallel completions, controlled byOLLAMA_NUM_PARALLEL. This is unset so ollama is using the default value of 4. You can reduce the VRAM footprint by settingOLLAMA_NUM_PARALLEL=1in the server environment.@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
that's with sudo nano /etc/systemd/system/ollama.service
then under the [service] section write:
Environment="OLLAMA_NUM_PARALLEL=1"
@rick-github commented on GitHub (Mar 3, 2025):
The recommended way is:
This will create an overrides file, and if ollama is upgraded the changes will be preserved. If you edit the service file directly, changes will be lost on the next upgrade.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
it worked after reboot.
NAME ID SIZE PROCESSOR UNTIL
granite3.2-vision:2b-q4_K_M 3be41a661804 4.7 GB 100% CPU 4 minutes from now
but this will affect all the other models. is there a way that I can make it specific to this model.
another thing, it's still too big. LLava7b on the same quantization level is about 6 gigabytes in ram, and faster. I don't know why but there is something that we're missing here.
just for comparison sake, I gave it an image and it's still haven't output. it's been about 20 minutes, and still nothing. Llave takes less than 5 minutes.
so in conclusion it's not working even after fixing the ram issue. there is something else we need to fix.
it's not really worth the time and effort but I am willing to stay on this out of curiosity if you are.
any ideas as to what is causing this? the cpu is running on full blast but there is repeating itself that I see in the system monitor. also a slight up and down in the ram corresponding to the cpu pattern changes. almost as if it's failing to load something, but not quite.
@rick-github commented on GitHub (Mar 3, 2025):
It's possible the model has lost cohesion and is just outputting a stream of tokens without hitting an end-of sequence token. You can make it exit this state by limiting the number of tokens it can generate with
num_predict.@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
hang on man, I am trying to run the same file on LLava to make sure that the issue isn't in the update of ollama version and also to make sure that the issue isn't in the picture I am using
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
alright so I changed the picture to png as that what i used in the past and tried llava and I am timing the response but I am up to 24 minutes now and nothing.
I will try reverting to the older version of ollama and run the same thing again. hopefully it will work again and we would have narrowed it down to the version update and you guys check what changed to have granite running that can affect Llava as well.
@rick-github commented on GitHub (Mar 3, 2025):
I did a quick check using 0.5.13-rc4 and didn't see any problems:
@rick-github commented on GitHub (Mar 3, 2025):
0.5.13-rc2 also no problems using the above script.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
so I installed the stable version. and i think I understand what I done wrong.
when I was tring to install the preview version. I done a mistake and tried to manually download the version I needed and then """sudo tar -C /usr -xzf ollama-linux-amd64.tgz""" this updated the client only without the version itself. then I went and """curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.5.13-rc2 sh""" this updated the version but I think I broke something in the process. now when I downloaded the stable version and command ollama --version it tells me the version is 0.5.12 but the client is 0.5.13-rc2 , so I am uninstalling ollama completely and installing again.
I'll keep you updated
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
Llava7b works in 4 minutes it gave out response.
by the way the client thing isn't relevant. I changed the version only to 0.5.12 and the client is 0.5.13-rc2 and Lava is working fine despite the client is higher. also thing means that the change that caused the issue is in the version not the client.
I have to emphasize that I am running on cpu and that is probably the reason why you are not experiencing the issue, you know cause you're running on gpu.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
a good strategy here is for me to update again and try to get llava to work. as we know what has changed for it in the code and try to compare it to the stable version.
again I have to say that it's not worth your time and energy and my laptop is old and it could be that there is something wrong with my laptop not necessarily ollama. but I am willing to stay at it till we get it to work, as long as you're interested.
I will update again and check the cohesion thing you mentioned earlier. but for now I have to take my father to the doctor and I'll be back in a couple of hours.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
I am back, my brother took my father to the doctor. so I tried to install using the same command as earlier """curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.5.13-rc2 sh""" but it won't install. have you guys rolled it back to work on it?
@rick-github commented on GitHub (Mar 3, 2025):
Only the most recent rc is made available, that's currently 0.5.13-rc5.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
I am downloading it and we'll see this through.
if the issue is in Llava as well as granite-vision, this means that the issue is not model specific and it's about the way ollama run vision models on cpu.
does that follow along with the model cohesion theory that we are investigating?
hopefully the rc5 version will already have solved the issue.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
nope, downloaded the rc5 version, same problem. up to 13 minutes for Llava and it won't output.
can you run it on cpu on your machine to rule that out. It's probably the cpu thing that goes under the radar. usually always developer have strong machines for LLMs and that means they always use gpu.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
just seen that script, I didn't notice it before. did you run it on gpu or cpu?
@rick-github commented on GitHub (Mar 3, 2025):
GPU for the above runs. Re-ran the test, CPU only, with 0.5.13-rc2, 0.5.13-rc4, 0.5.13-rc5. No issues other than longer processing time.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
I tried limiting the num_predict but nothing. same thing loads forever.
maybe I didn't do it right.
import subprocess
import base64
import requests
import json
models = [ "llava:7b"]
image_path = "/home/abdelrahman/Pictures/Screenshots/p.png" # Make sure puppy.jpg is in the same directory or provide full path
api_url = "http://localhost:11434/api/chat"
try:
with open(image_path, "rb") as image_file:
image_data = image_file.read()
base64_image = base64.b64encode(image_data).decode('utf-8') # Encode to base64 and decode to string
except FileNotFoundError:
print(f"Error: {image_path} not found. Please make sure the image file exists in the same directory or provide the correct path.")
exit(1)
for model in models:
print(f"{model}: ", end="") # Print model name without newline
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
is there a way i can list the options values to confirm that It received the change?
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
in the past script, I used gemini to change the code from Bash to python and then added the option change manually
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
in the past I had an issue with moondream running on ollama. and I think that this is an extension of that.
moondream didn't load forever. but it gave an error that I can't remember now.
ironically when I install pip moondream and use the code they provide on their website it runs normally. so it is definitely something related to ollama and how the model is handled.
also if it runs on your machine then it's probably has something to do with my old laptop and different kernals of ubuntu corresponding to the older machine.
i am talking to you from a 2015 laptop. it's not ancient but still a decade old.
@rick-github commented on GitHub (Mar 3, 2025):
This is correct.
I went back and had a look at your log.
Your CPU has no vector extensions. It's not that the model is not generating output, it's just that your CPU is not suited for the matrix operations that LLM inference uses.
Try this instead of the script:
There will be pause (perhaps several minutes) as the image is processed, then the model will start to generate tokens. Since this is running in streaming mode, you will see the tokens as they are generated, rather than waiting for the complete output as with the script.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
i ran the command but I still don't understand why does this particular version of ollama isn't running llava and the previous versions run it. may be the inference method has been changed to suite the new models on the premise of this should work for the old models as well?
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
11 minutes and nothing. in the 0.5.12 version it output after about 4 minutes. I will leave it till it's 20 minutes
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
I will redownload the 0.5.12 version and run the command and give you the server lot again.
@rick-github commented on GitHub (Mar 3, 2025):
This version of ollama does run llava:7b. What could be happening is that the right CPU backend is not being selected. That would explain why your log shows no vector extensions and runs much slower than 0.5.12. If you rollback to 0.5.12 and examine the logs after running the model, what does the line with
msg=system info=show?@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
server_log_ollama_2.txt
@rick-github commented on GitHub (Mar 3, 2025):
0.5.12:
0.5.13-rc5
Looks like a build issue. The same CPU backend has been selected in both cases but 0.5.13-rc5 has no vector extensions. https://github.com/ollama/ollama/pull/9425 was merged a couple of days ago for a similar issue. How sure are you that you have completely deleted the previous ollama versions? What's the output of:
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
abdelrahman@box:~$ ls -l $(dirname $(dirname $(command -v ollama)))/lib/ollama
total 3092
drwxr-xr-x 2 root root 4096 Feb 28 19:35 cuda_v11
drwxr-xr-x 2 root root 4096 Feb 28 19:35 cuda_v12
-rwxr-xr-x 1 root root 587424 Feb 28 19:19 libggml-base.so
-rwxr-xr-x 1 root root 470984 Feb 28 19:19 libggml-cpu-alderlake.so
-rwxr-xr-x 1 root root 470984 Feb 28 19:19 libggml-cpu-haswell.so
-rwxr-xr-x 1 root root 573384 Feb 28 19:19 libggml-cpu-icelake.so
-rwxr-xr-x 1 root root 479176 Feb 28 19:19 libggml-cpu-sandybridge.so
-rwxr-xr-x 1 root root 573384 Feb 28 19:19 libggml-cpu-skylakex.so
####################################################################
abdelrahman@box:~$ ls -l /usr/local/lib/ollama/libggml-cpu-haswell.so
-rwxr-xr-x 1 root root 466768 Feb 23 22:20 /usr/local/lib/ollama/libggml-cpu-haswell.so
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
but this is with the 0.5.12 installed and the 0.5.13-rc2 client installed
abdelrahman@box:~$ ollama --version
ollama version is 0.5.12
Warning: client version is 0.5.13-rc2
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
I will try to uninstall and install again.
i done that before but I was afraid it will remove all the models so I didn't follow the instructions precisely. I will try again without removing the models but follow all the instructions, skipping the ones under removing the model in the main documentation of ollama
https://github.com/ollama/ollama/blob/main/docs/linux.md
@rick-github commented on GitHub (Mar 3, 2025):
What's the output of
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
hang on, I uninstalled it already and installing the new version 0.5.13-rc6 again. I done the following commands to uninstall:
sudo systemctl stop ollama
sudo systemctl disable ollama
sudo rm /etc/systemd/system/ollama.service
sudo rm $(which ollama)
sudo rm -rf /usr/local/lib/ollama
i didn't do the other commands:
sudo rm -r /usr/share/ollama
sudo userdel ollama
sudo groupdel ollama
install is at 64% now
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
abdelrahman@box:~$ type ollama
ollama is /usr/local/bin/ollama
abdelrahman@box:~$ command -v ollama
/usr/local/bin/ollama
abdelrahman@box:~$ which ollama
/usr/local/bin/ollama
abdelrahman@box:~$ ls -l $(command -v ollama)
-rwxr-xr-x 1 root root 31575552 Mar 3 04:27 /usr/local/bin/ollama
abdelrahman@box:~$ ollama --version
ollama version is 0.5.13-rc6
I also started another test after installing the new 0.5.13-rc6 , I will tell you the results.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
guess what, it worked
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
sorry for wasting all that time and effort.
so in conclusion, all I really needed to do is uninstall ollama and then install it again using the command lines above.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
thanks for your help and sorry again.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
hey, before you go. I tested it with llava like I told you above but now I am testing granite-vision and it's back to 11 gigabytes cause I changed it with
sudo nano /etc/systemd/system/ollama.service
then under the [service] section write:
Environment="OLLAMA_NUM_PARALLEL=1"
and it revert after the update like you said. but I redone the above steps so now it doesn't take 11 gigabytes
so it's been about 8 minutes now and no output from granite-vision.
the ollama ps shows the model taking 4.7 gigabytes in ram.
@rick-github commented on GitHub (Mar 3, 2025):
granite3.2-vision is much more verbose, it will take longer to generate a response.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
yeah you are right it just started output. it took about 15 minutes
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
but isn't it weird that a 2 billion parameter model is taking just as much as a 7 billion parameter model in ram and also 3 times as long to output
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
I guess that is an issue with the model not with ollama, right?
@rick-github commented on GitHub (Mar 3, 2025):
The default context window for most models is 2048 tokens. The default for granite-3.2-vision is 16384 tokens, so it needs 8 times more VRAM for the context buffer than most other models. It was probably configured this way precisely because it is more verbose than other models.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
thanks man, everything is running the way it should.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
hey another good news. the uninstall and then reinstall fixed the moondream model. now it's running on ollama. so if someone else is complaining about errors, uninstalling and reinstalling will fix it.
@Abdulrahman392011 commented on GitHub (Mar 3, 2025):
@Abdulrahman392011 commented on GitHub (Mar 4, 2025):
hey, I have been reading a bit online about ollama parallel number. and I was wondering why do ollama reserve the memory from the start and not wait till there is another parallel request and then increase the parallel number to two and then after the request is done return it back to one.(for example)
I am no expert but it seems like a the logical thing to do. and it shouldn't be all that hard to implement and will actually decrease the memory footprint of ollama allowing the system to use it for something else while ollama is running in the background.
kinda like how microstat is to temperature.
@rick-github commented on GitHub (Mar 4, 2025):
If there's not enough VRAM (because the current completion instance has allocated temporary VRAM) the runner will crash.
@Abdulrahman392011 commented on GitHub (Mar 4, 2025):
clarify a bit more, I don't get it.
you're saying that it's dangerous to make it parallel number to 1 at the loading of the model. and if then another request is given it will simply crash instead of increasing the parallel number to 2.
if so then why don't ollama make some sort of a handler that handle the request and if another request is add instead of directly give it to the runner and crash it. waits for the runner is done with the request it has and then modify the runner even if it means loading the model again.
then handler version 2 will be able to modify the runner configurations without reloading the model from scratch
@Abdulrahman392011 commented on GitHub (Mar 4, 2025):
pardon my English it is not my first language
@Abdulrahman392011 commented on GitHub (Mar 4, 2025):
another thing to consider here is that most people that use ollama are using it locally for personal use and that interpret to usually one request at a time. so for most use cases, starting with max parallel number of one will be all what most users use about 90% of the time. and for the other 10% there will be no harm in reloading the model for that use case then reverting back to normal.
and again, eventually a way will be found to reconfigure the runner without reloading the model.
@rick-github commented on GitHub (Mar 4, 2025):
The ollama request handler does request queuing when the runners are busy. But what's the advantage of dynamically allocating context buffers over just allocating when the runner starts? It adds unnecessary overhead. The runners have to start dealing with memory fragmentation, over-commit, OOM scenarios, competition from other VRAM users, etc.
@Abdulrahman392011 commented on GitHub (Mar 4, 2025):
I put your comment in gemini and it's favoring my opinion. take a look:
Dynamic allocation of context buffers in request handlers, like in Ollama, offers several advantages over allocating them statically when the runner starts, despite the overhead you mentioned. Let's explore these benefits:
Resource Optimization and Efficiency:
Addressing Memory Fragmentation and Over-Commit:
While you are right to point out the challenges of memory fragmentation, over-commit, and OOM scenarios with dynamic allocation, these issues are often managed through memory management techniques and are considered acceptable trade-offs for the benefits of resource optimization and scalability.
Why Dynamic Allocation Might Be Preferred Despite Overhead:
The "unnecessary overhead" you mentioned in dynamic allocation (allocation/deallocation operations) is generally considered to be less significant than the overhead and limitations associated with static allocation in scenarios where resource efficiency, scalability, and flexibility are paramount.
In summary, while dynamic context buffer allocation introduces complexities related to memory management, it is often the preferred approach in request handlers like Ollama because it provides significant advantages in terms of resource optimization, scalability, and flexibility in handling varying workloads and context sizes. Modern memory management techniques are employed to mitigate the potential downsides of dynamic allocation, making it a practical and efficient choice for such systems.
@rick-github commented on GitHub (Mar 5, 2025):
LLMs do not understand complex systems.
Gemini is talking about slab allocators, scalability and resource efficiency. This is all fine and good in a general computing environment like your desktop computer. This is not the same environment as a GPU. A GPU has a single purpose - perform matrix calculations on a bunch of numbers. There's nothing to be gained from allocating gigabytes of memory and then freeing it 20 seconds later. It's just adding overhead to the completion.
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
I see.
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
about the trick we done with reducing the number of max parallel in system ollama file. and that reduced the size of the model to 5 gigabytes from 11. most graphics cards that is out there doesn't come with 16 gigabytes of vram. so it will not just be me that is struggling. we are still talking about a 2 billion parameter model.
also why do you say that it will be freed after 20 seconds. you are looking at the wrong side of the problem. it is not needed 90% of the time the user is using ollama.
so you will find a lot of people setting their settings to max parallel 1 , and when ever they try to run anything in parallel it will be queued, which is fine.
most people don't even know how to set the max parallel num to 1 , they will just leave it at 4 and then find that the system can't run a 2.5 gigabyte model cause they have an 8 gigabyte of vram and they need 11 at least to run the 2 billion parameter model.
what I am saying is, why reserve the memory if it's not needed. I made the settings into 1 and the model ran just fine with 5 gigabytes. so why would any one reserve 11 gigabytes for that model. for what? the possibility of having another request, while we know that ollama is used locally for one user. so one request is all we need and if another request is needed the handler should consider increasing the max parallel num if there is enough vram, if not then it should be queued.
also it could be made so that the max parallel num doesn't return to 1 untill a few minutes have passed instead of 20 seconds. that way the overhead problem won't be as prominent if the user is handling multiple requests at a regular interval.
@rick-github commented on GitHub (Mar 5, 2025):
ollama only uses 4 as the default for
OLLAMA_NUM_PARALLELif there's enough resources to do so. If it thinks there's not enough, it falls back to 1. So it seems that there is a problem with the logic there, maybe because this is a vision model which has an extra set of weights. I'll have a look.@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
could it be that I have swap memory enabled and it thinks of it as regular ram.
memory.available="[5.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.4 GiB" memory.required.partial="0 B" memory.required.kv="1.5 GiB" memory.required.allocations="[5.4 GiB]" memory.weights.total="5.6 GiB" memory.weights.repeating="5.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB"
Mar 04 22:01:21 box ollama[2541]: time=2025-03-04T22:01:21.247-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-3953920e5adb604f45914bd4c30e2f1df1fde7456a8e0471b2086577da1d46fd --ctx-size 8192 --batch-size 512 --threads 2 --no-mmap --parallel 4 --port 35331"
I used it with a normal model (nous-hermes2:10.7b-solar-q2_K). it didn't reduce the parallel number
@rick-github commented on GitHub (Mar 5, 2025):
No, what I think is happening is the fallback to 1 only kicks in if all of the model will fit in the available VRAM. That is, say the model takes 11G at parallel=4 and 5G at parallel=1 and you have 4.9G free on the GPU, ollama gives up because it can't fit everything on the GPU and goes with parallel=4.
When you have the model loaded with parallel=1, what's the output of
ollama ps?@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
hang on, I just came back the cats in the garden was fighting, and I had to eat something for fasting tomorrow, it's the second day of Ramadan.
I changed it and will restart the laptop now it takes 6.9 GB
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
nous-hermes2:10.7b-solar-q2_K 2931d5c846b2 5.2 GB 100% CPU
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
memory.available="[5.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.8 GiB" memory.required.partial="0 B" memory.required.kv="384.0 MiB" memory.required.allocations="[4.8 GiB]" memory.weights.total="4.5 GiB" memory.weights.repeating="4.4 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB"
Mar 04 22:32:53 box ollama[2497]: time=2025-03-04T22:32:53.761-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-3953920e5adb604f45914bd4c30e2f1df1fde7456a8e0471b2086577da1d46fd --ctx-size 2048 --batch-size 512 --threads 2 --no-mmap --parallel 1 --port 42335"
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
memory.available="[5.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.4 GiB" memory.required.partial="0 B" memory.required.kv="1.5 GiB" memory.required.allocations="[5.5 GiB]" memory.weights.total="5.6 GiB" memory.weights.repeating="5.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB"
Mar 04 22:27:39 box ollama[2541]: time=2025-03-04T22:27:39.313-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-3953920e5adb604f45914bd4c30e2f1df1fde7456a8e0471b2086577da1d46fd --ctx-size 8192 --batch-size 512 --threads 2 --no-mmap --parallel 4 --port 43149"
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
I am running everything on cpu. the gpu in the laptop is old nvidia and has 4 gigabytes of vram. but ollama doesn't use it cause it has compute capability of 3.4 , I think or something like that. it's old.
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
so what you're saying is that if the model will fit in the memory with parallel 1 it will do that automatically but if it's gonna end up on swap memory anyway it goes with parallel 4.
I will try to find a model that is big enough to fit in the ram and but not too small that it fits with parallel 4.
tricky!
@rick-github commented on GitHub (Mar 5, 2025):
That's my guess.
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
memory.available="[4.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.4 GiB" memory.required.partial="0 B" memory.required.kv="1.2 GiB" memory.required.allocations="[4.4 GiB]" memory.weights.total="2.6 GiB" memory.weights.repeating="2.5 GiB" memory.weights.nonrepeating="78.8 MiB" memory.graph.full="853.3 MiB" memory.graph.partial="853.3 MiB" projector.weights="851.2 MiB" projector.graph="0 B"
Mar 03 07:58:33 box ollama[2578]: time=2025-03-03T07:58:33.194-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-1aefcd9a8a15091b670951963b5f8a7e6653bb1350345e9621e179685ac9bc5f --ctx-size 16384 --batch-size 512 --mmproj /usr/share/ollama/.ollama/models/blobs/sha256-4d464be24899cf8dc1862945432e0cef4366c4181fa38b14754cc9279b727608 --threads 2 --no-mmap --parallel 1 --port 33301"
I was looking in the server log. trying to go back in time where I didn't set the max parallel to 1 to see the pattern of it setting max parallel to 1 on it's own and I think this is an example to confirm what you're saying
@rick-github commented on GitHub (Mar 5, 2025):
Yeah, it calculated that at parallel=1 it could use 4.4G of the 4.9G available, so it went with that instead of parallel=4 which would have resulted in spilling the model to system RAM.
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
no man, I think it's not that. I was trying to replicate the same principle to get it to automatically revert to 1 parallel and I couldn't do it. however, I looked again in the server log and I found that the date of the first parallel 1 log was 3rd of march. that was basically when we started talking.
I am not sure if that is enough to consider that there is an issue with ollama or my system and install. or simply that I wasn't exposed to the right situation to trigger the automatic change into parallel 1.
what do you think?
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
I mean have you tested this feature (reverting to 1 parallel when needed) yourself. do you know for a fact that it is functioning the way it should?
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
It's usually not a wise idea to doubt the source code that actively serve thousands of users but mistakes and bugs can happen from one release to the other.
anyway ollama just released the 0.5.13 version and people are gonna use the granite-vision model and they should have the same issue that I had if the problem is in the code not my device and system.
so time will tell.
@rick-github commented on GitHub (Mar 5, 2025):
It appears to function as surmised.
@rick-github commented on GitHub (Mar 5, 2025):
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
could it be that this only works for gpu.
try it again while running it on cpu.
@rick-github commented on GitHub (Mar 5, 2025):
Why? The change to the default depends on free VRAM.
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
dude lol. I am running on cpu. the laptop that I have got an nvidia card but it's old compute capability and ollama doesn't use it.
that being said, ollama does say that there is an nvidia card when I install ollama. so it could be something like ollama is confused regarding the card and the cpu ram.
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
I will try and do the same experiment you done with increasing the ctx-size and monitor the parallel num
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
so I tried to get it to work but no. I am not using docker and what I do is that I set the model ctx-size like you do "/set parameter num_ctx 16384" but the server log doesn't have any recollection of any change and I repeated it multiple times but the server log has only one log of the model being loaded with the initial value of 8192 .
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
do you think using docker might help with this. I don't have docker installed. can you view your server log to see if there is any change when you change it with the previous method. maybe that's normal and the server log simply doesn't reflect the change even though it's happening!
@rick-github commented on GitHub (Mar 5, 2025):
The parallel adjustment is only done for VRAM. Since you are using system RAM, ollama will use the default of 4 for
OLLAMA_NUM_PARALLEL.@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
I knew I wasn't crazy, lol.
@Abdulrahman392011 commented on GitHub (Mar 5, 2025):
at least very few people will have this issue with the model. most people use gpu. no one will use cpu and wait around for 15 minutes for an image description.