mirror of
https://github.com/ollama/ollama.git
synced 2026-05-07 16:40:08 -05:00
Closed
opened 2026-04-28 21:02:31 -05:00 by GiteaMirror
·
32 comments
No Branch/Tag Specified
main
dhiltgen/ci
dhiltgen/llama-runner
hoyyeva/anthropic-local-image-path
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.30.0-rc5
v0.23.2-rc0
v0.30.0-rc4
v0.30.0-rc3
v0.30.0-rc2
v0.30.0-rc1
v0.30.0-rc0
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#51832
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @blueApple12 on GitHub (Jan 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8310
What is the issue?
I bought a new pc with 4070 Super to do some AI tasks using Ollama, but when I tried to run llama3.2-vision it just didn't utilize my GPU and only utilize my CPU, llama3.2 does utilize my GPU, so why is that? thank you.
OS
Windows
GPU
Nvidia
CPU
AMD
Ollama version
0.5.4
@rick-github commented on GitHub (Jan 5, 2025):
Maybe not enough free VRAM on your system, depending on what else you are running. The output of
nvidia-smiand server logs will aid in identifying the cause.@blueApple12 commented on GitHub (Jan 5, 2025):
this is my smi:
Sun Jan 5 18:15:15 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 566.36 Driver Version: 566.36 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 ... WDDM | 00000000:01:00.0 On | N/A |
| 0% 42C P0 33W / 220W | 1350MiB / 12282MiB | 14% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
@rick-github commented on GitHub (Jan 5, 2025):
server logs will aid in identifying the cause.
@blueApple12 commented on GitHub (Jan 6, 2025):
this is my log:
2025/01/05 16:43:45 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\avish\.ollama\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-01-05T16:43:45.920+02:00 level=INFO source=images.go:757 msg="total blobs: 12"
time=2025-01-05T16:43:45.926+02:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0"
time=2025-01-05T16:43:45.929+02:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)"
time=2025-01-05T16:43:45.930+02:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx cpu]"
time=2025-01-05T16:43:45.931+02:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2025-01-05T16:43:45.932+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-01-05T16:43:45.932+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12
time=2025-01-05T16:43:46.096+02:00 level=INFO source=amd_hip_windows.go:103 msg="AMD ROCm reports no devices found"
time=2025-01-05T16:43:46.096+02:00 level=INFO source=amd_windows.go:50 msg="no compatible amdgpu devices detected"
time=2025-01-05T16:43:46.099+02:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB"
[GIN] 2025/01/05 - 16:43:59 | 200 | 500µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:43:59 | 200 | 2.5007ms | 127.0.0.1 | GET "/api/tags"
time=2025-01-05T16:44:53.262+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T16:44:53.331+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.4 GiB" free_swap="10.1 GiB"
time=2025-01-05T16:44:53.336+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T16:44:53.346+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59288"
time=2025-01-05T16:44:53.352+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T16:44:53.353+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T16:44:53.353+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T16:44:53.387+02:00 level=INFO source=runner.go:945 msg="starting go runner"
time=2025-01-05T16:44:53.404+02:00 level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T16:44:53.406+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59288"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
time=2025-01-05T16:44:53.607+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: CPU model buffer size = 5679.33 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 656.25 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CPU backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T16:45:03.649+02:00 level=INFO source=server.go:594 msg="llama runner started in 10.30 seconds"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 16:45:09 | 200 | 16.5585361s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:45:09.804+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:45:19 | 200 | 9.9060809s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:45:36.380+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:45:41 | 200 | 5.0508025s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:45:41.512+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:45:45 | 500 | 4.0971502s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:47:36.667+02:00 level=INFO source=runner.go:662 msg="aborting completion request due to client closing the connection"
time=2025-01-05T16:47:38.948+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:47:43 | 200 | 4.9453625s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:47:43.887+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:47:52 | 200 | 8.9424866s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:48:08.430+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:48:14 | 200 | 5.8496372s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:48:14.287+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:48:27 | 200 | 13.5327677s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:48:45.398+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:48:51 | 200 | 5.8480718s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:48:51.241+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:49:08 | 200 | 17.0670151s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:49:41.721+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:49:47 | 200 | 5.3733708s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:49:47.151+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:50:14 | 500 | 27.8632648s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:50:22 | 200 | 997.4µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:50:22 | 200 | 26.5029ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T16:50:22.893+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:50:22 | 200 | 24.5118ms | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T16:50:24.739+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:50:40 | 200 | 15.6020101s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:51:35 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:51:35 | 200 | 62.5791ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T16:51:35.660+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="10.5 GiB"
time=2025-01-05T16:51:35.661+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=4 available=11251539968 required="3.7 GiB"
time=2025-01-05T16:51:35.683+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.8 GiB" free_swap="19.7 GiB"
time=2025-01-05T16:51:35.683+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
time=2025-01-05T16:51:35.688+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 4 --port 59523"
time=2025-01-05T16:51:35.695+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=2
time=2025-01-05T16:51:35.695+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T16:51:35.695+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T16:51:36.419+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T16:51:36.460+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T16:51:36.463+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59523"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
time=2025-01-05T16:51:36.702+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU model buffer size = 308.23 MiB
llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB
llama_new_context_with_model: n_seq_max = 4
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB
llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 2.00 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 424.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB
llama_new_context_with_model: graph nodes = 902
llama_new_context_with_model: graph splits = 2
time=2025-01-05T16:51:38.959+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.26 seconds"
[GIN] 2025/01/05 - 16:51:38 | 200 | 3.364194s | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/01/05 - 16:51:51 | 200 | 676.757ms | 127.0.0.1 | POST "/api/chat"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 16:51:57 | 200 | 1.1356145s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:52:11 | 200 | 5.1867467s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:52:20 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:52:20 | 200 | 15.4987ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T16:52:20.927+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T16:52:25.947+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0186925 model=C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068
time=2025-01-05T16:52:26.009+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="7.1 GiB"
time=2025-01-05T16:52:26.196+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2687383 model=C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068
time=2025-01-05T16:52:26.356+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.8 GiB" free_swap="19.7 GiB"
time=2025-01-05T16:52:26.358+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=33 layers.split="" memory.available="[10.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.3 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.3 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T16:52:26.363+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 33 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59564"
time=2025-01-05T16:52:26.368+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T16:52:26.368+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T16:52:26.368+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T16:52:26.470+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T16:52:26.509+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T16:52:26.510+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59564"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
time=2025-01-05T16:52:26.620+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: offloading 33 repeating layers to GPU
llm_load_tensors: offloaded 33/41 layers to GPU
llm_load_tensors: CPU model buffer size = 281.83 MiB
llm_load_tensors: CUDA_Host model buffer size = 1306.52 MiB
llm_load_tensors: CUDA0 model buffer size = 4090.98 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 48.00 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 558.22 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 71 (with bs=512), 3 (with bs=1)
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T16:52:35.902+02:00 level=INFO source=server.go:594 msg="llama runner started in 9.53 seconds"
[GIN] 2025/01/05 - 16:52:35 | 200 | 14.9908729s | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T16:52:44.743+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:52:54 | 200 | 10.2073533s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:53:23.080+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 16:53:35 | 200 | 12.800509s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:55:02.546+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="644.1 MiB"
time=2025-01-05T16:55:02.868+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=4 available=11424337920 required="3.7 GiB"
time=2025-01-05T16:55:02.891+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.2 GiB" free_swap="19.5 GiB"
time=2025-01-05T16:55:02.891+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
time=2025-01-05T16:55:02.896+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 4 --port 59620"
time=2025-01-05T16:55:02.900+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T16:55:02.900+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T16:55:02.901+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T16:55:02.994+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T16:55:03.031+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T16:55:03.032+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59620"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
time=2025-01-05T16:55:03.152+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU model buffer size = 308.23 MiB
llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB
llama_new_context_with_model: n_seq_max = 4
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB
llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 2.00 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 424.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB
llama_new_context_with_model: graph nodes = 902
llama_new_context_with_model: graph splits = 2
time=2025-01-05T16:55:03.904+02:00 level=INFO source=server.go:594 msg="llama runner started in 1.00 seconds"
[GIN] 2025/01/05 - 16:55:07 | 200 | 4.7508347s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:55:38 | 200 | 15.5506907s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:56:03 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:56:03 | 404 | 497.8µs | 127.0.0.1 | POST "/api/show"
[GIN] 2025/01/05 - 16:56:04 | 200 | 1.0677276s | 127.0.0.1 | POST "/api/pull"
[GIN] 2025/01/05 - 16:56:12 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:56:12 | 200 | 16.0004ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T16:56:12.092+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T16:56:12.138+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="7.2 GiB"
time=2025-01-05T16:56:12.485+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.4 GiB" free_swap="19.7 GiB"
time=2025-01-05T16:56:12.488+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=34 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.5 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.5 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T16:56:12.492+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 34 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59649"
time=2025-01-05T16:56:12.497+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T16:56:12.497+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T16:56:12.497+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T16:56:12.583+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T16:56:12.618+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T16:56:12.619+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59649"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2025-01-05T16:56:12.748+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: offloading 34 repeating layers to GPU
llm_load_tensors: offloaded 34/41 layers to GPU
llm_load_tensors: CPU model buffer size = 281.83 MiB
llm_load_tensors: CUDA_Host model buffer size = 1189.49 MiB
llm_load_tensors: CUDA0 model buffer size = 4208.01 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 40.00 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 566.22 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 60 (with bs=512), 3 (with bs=1)
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T16:56:15.507+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.01 seconds"
[GIN] 2025/01/05 - 16:56:15 | 200 | 3.4331242s | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T16:56:16.504+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 16:56:17 | 200 | 1.1631321s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T16:56:25.494+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 16:56:32 | 200 | 7.4890539s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 16:57:25 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 16:57:25 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/01/05 - 17:17:12 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:17:12 | 200 | 16.5005ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T17:17:12.973+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T17:17:13.029+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.5 GiB" free_swap="19.4 GiB"
time=2025-01-05T17:17:13.033+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=32 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.2 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.2 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T17:17:13.038+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 32 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 59968"
time=2025-01-05T17:17:13.043+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T17:17:13.043+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T17:17:13.043+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T17:17:13.146+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T17:17:13.182+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T17:17:13.183+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:59968"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
time=2025-01-05T17:17:13.295+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloaded 32/41 layers to GPU
llm_load_tensors: CPU model buffer size = 281.83 MiB
llm_load_tensors: CUDA_Host model buffer size = 1439.02 MiB
llm_load_tensors: CUDA0 model buffer size = 3958.48 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 56.00 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 550.22 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 82 (with bs=512), 3 (with bs=1)
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T17:17:16.304+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.26 seconds"
[GIN] 2025/01/05 - 17:17:16 | 200 | 3.3481249s | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T17:17:18.345+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 17:17:19 | 200 | 690.9205ms | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T17:17:31.745+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 17:17:37 | 200 | 5.7424355s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 17:17:49 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:17:49 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
time=2025-01-05T17:32:26.860+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=1 available=3969449984 required="2.9 GiB"
time=2025-01-05T17:32:26.881+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.6 GiB" free_swap="9.6 GiB"
time=2025-01-05T17:32:26.882+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[3.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.9 GiB" memory.required.partial="2.9 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[2.9 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB"
time=2025-01-05T17:32:26.887+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 1 --port 60186"
time=2025-01-05T17:32:26.892+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T17:32:26.892+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T17:32:26.892+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T17:32:27.004+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T17:32:27.040+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T17:32:27.041+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:60186"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
time=2025-01-05T17:32:27.144+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU model buffer size = 308.23 MiB
llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CUDA0 KV buffer size = 224.00 MiB
llama_new_context_with_model: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 256.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 10.01 MiB
llama_new_context_with_model: graph nodes = 902
llama_new_context_with_model: graph splits = 2
time=2025-01-05T17:32:28.148+02:00 level=INFO source=server.go:594 msg="llama runner started in 1.26 seconds"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 17:32:28 | 200 | 1.9139322s | 127.0.0.1 | POST "/api/chat"
time=2025-01-05T17:32:28.741+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T17:32:28.779+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="1.1 GiB"
time=2025-01-05T17:32:29.126+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.6 GiB" free_swap="9.6 GiB"
time=2025-01-05T17:32:29.130+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[3.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T17:32:29.131+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 60190"
time=2025-01-05T17:32:29.138+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T17:32:29.138+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T17:32:29.138+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T17:32:29.154+02:00 level=INFO source=runner.go:945 msg="starting go runner"
time=2025-01-05T17:32:29.171+02:00 level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T17:32:29.172+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:60190"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
time=2025-01-05T17:32:29.389+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: CPU model buffer size = 5679.33 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 656.25 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CPU backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T17:32:38.426+02:00 level=INFO source=server.go:594 msg="llama runner started in 9.29 seconds"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/05 - 17:32:50 | 200 | 22.0606032s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 17:44:48 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:44:48 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/01/05 - 17:44:58 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:44:58 | 200 | 16.9993ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T17:44:58.564+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 parallel=4 available=11475161088 required="3.7 GiB"
time=2025-01-05T17:44:58.585+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.3 GiB" free_swap="17.9 GiB"
time=2025-01-05T17:44:58.585+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
time=2025-01-05T17:44:58.590+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 6 --no-mmap --parallel 4 --port 63944"
time=2025-01-05T17:44:58.599+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T17:44:58.599+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T17:44:58.599+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T17:44:58.726+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T17:44:58.763+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T17:44:58.764+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:63944"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
time=2025-01-05T17:44:58.850+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from C:\Users\avish.ollama\models\blobs\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.21 B
llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU model buffer size = 308.23 MiB
llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB
llama_new_context_with_model: n_seq_max = 4
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB
llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 2.00 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 424.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB
llama_new_context_with_model: graph nodes = 902
llama_new_context_with_model: graph splits = 2
time=2025-01-05T17:45:00.104+02:00 level=INFO source=server.go:594 msg="llama runner started in 1.51 seconds"
[GIN] 2025/01/05 - 17:45:00 | 200 | 1.5937332s | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/01/05 - 17:45:04 | 200 | 648.2546ms | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 17:45:11 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:45:11 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/01/05 - 17:45:20 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:45:20 | 200 | 16.4985ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T17:45:20.728+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-05T17:45:20.764+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-9c583a21-4c07-29c6-80d1-fb420a8f07f5 library=cuda total="12.0 GiB" available="7.3 GiB"
time=2025-01-05T17:45:21.110+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="18.4 GiB" free_swap="18.0 GiB"
time=2025-01-05T17:45:21.112+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=35 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.6 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.6 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-05T17:45:21.118+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 35 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 63956"
time=2025-01-05T17:45:21.123+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T17:45:21.123+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-05T17:45:21.123+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T17:45:21.205+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-05T17:45:21.240+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-05T17:45:21.241+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:63956"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2025-01-05T17:45:21.375+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: offloading 35 repeating layers to GPU
llm_load_tensors: offloaded 35/41 layers to GPU
llm_load_tensors: CPU model buffer size = 281.83 MiB
llm_load_tensors: CUDA_Host model buffer size = 1072.46 MiB
llm_load_tensors: CUDA0 model buffer size = 4325.04 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 32.00 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 574.22 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 49 (with bs=512), 3 (with bs=1)
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-05T17:45:24.635+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.51 seconds"
[GIN] 2025/01/05 - 17:45:24 | 200 | 3.922538s | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T17:45:27.666+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 17:45:28 | 200 | 1.3313811s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 17:45:31 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:45:31 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/01/05 - 17:45:43 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:45:43 | 200 | 20.9989ms | 127.0.0.1 | POST "/api/show"
time=2025-01-05T17:45:43.977+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 17:45:43 | 200 | 16.001ms | 127.0.0.1 | POST "/api/generate"
time=2025-01-05T17:45:56.150+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/05 - 17:46:33 | 200 | 37.3366449s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/01/05 - 17:46:36 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/05 - 17:46:36 | 200 | 0s | 127.0.0.1 | GET "/api/ps"
[GIN] 2025/01/05 - 17:49:35 | 200 | 500.9µs | 127.0.0.1 | GET "/api/version"
[GIN] 2025/01/06 - 17:54:18 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/06 - 17:54:18 | 200 | 16.4978ms | 127.0.0.1 | POST "/api/show"
time=2025-01-06T17:54:18.334+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-06T17:54:18.418+02:00 level=INFO source=server.go:104 msg="system memory" total="31.1 GiB" free="17.8 GiB" free_swap="18.0 GiB"
time=2025-01-06T17:54:18.421+02:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=36 layers.split="" memory.available="[10.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.8 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.8 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-06T17:54:18.426+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\avish\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12_avx\ollama_llama_server.exe runner --model C:\Users\avish\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 36 --mmproj C:\Users\avish\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 57521"
time=2025-01-06T17:54:18.436+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-06T17:54:18.436+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-06T17:54:18.437+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-06T17:54:18.577+02:00 level=INFO source=runner.go:945 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
time=2025-01-06T17:54:18.622+02:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-01-06T17:54:18.623+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:57521"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
time=2025-01-06T17:54:18.688+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: offloading 36 repeating layers to GPU
llm_load_tensors: offloaded 36/41 layers to GPU
llm_load_tensors: CPU model buffer size = 281.83 MiB
llm_load_tensors: CUDA_Host model buffer size = 939.96 MiB
llm_load_tensors: CUDA0 model buffer size = 4457.54 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 24.00 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 50.03 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 582.22 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 38 (with bs=512), 3 (with bs=1)
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-06T17:54:22.199+02:00 level=INFO source=server.go:594 msg="llama runner started in 3.76 seconds"
[GIN] 2025/01/06 - 17:54:22 | 200 | 3.8828603s | 127.0.0.1 | POST "/api/generate"
time=2025-01-06T17:54:26.611+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/01/06 - 17:54:27 | 200 | 1.042562s | 127.0.0.1 | POST "/api/chat"
time=2025-01-06T17:54:36.996+02:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\avish.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llama_model_load: vocab only - skipping tensors
[GIN] 2025/01/06 - 17:55:04 | 200 | 27.6056929s | 127.0.0.1 | POST "/api/chat"
@ChandlerHooley commented on GitHub (Jan 7, 2025):
Having this same issue as well. Latest version of Ollama and NVIDIA GTX 1650 SUPER graphics card. (Yes, I know it isn't powerful, this is just for a POC). Here are my logs when I run the "ollama serve" and then in another window the "ollama run llama3.2-vision" command. If I can provide any other information that would help, please let me know.
2025/01/06 22:07:11 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\chand\.ollama\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-01-06T22:07:11.022-06:00 level=INFO source=images.go:757 msg="total blobs: 11"
time=2025-01-06T22:07:11.023-06:00 level=INFO source=images.go:764 msg="total unused blobs removed: 0"
time=2025-01-06T22:07:11.024-06:00 level=INFO source=routes.go:1310 msg="Listening on 127.0.0.1:11434 (version 0.5.4)"
time=2025-01-06T22:07:11.024-06:00 level=INFO source=routes.go:1339 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx]"
time=2025-01-06T22:07:11.024-06:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2025-01-06T22:07:11.024-06:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-01-06T22:07:11.024-06:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16
time=2025-01-06T22:07:11.158-06:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-dd59afa0-5797-0eb5-41fa-a4e67e77623f library=cuda variant=v12 compute=7.5 driver=12.6 name="NVIDIA GeForce GTX 1650 SUPER" total="4.0 GiB" available="3.2 GiB"
[GIN] 2025/01/06 - 22:07:22 | 200 | 544.5µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/01/06 - 22:07:22 | 200 | 54.7107ms | 127.0.0.1 | POST "/api/show"
time=2025-01-06T22:07:22.747-06:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2025-01-06T22:07:22.793-06:00 level=INFO source=server.go:104 msg="system memory" total="63.7 GiB" free="42.7 GiB" free_swap="45.0 GiB"
time=2025-01-06T22:07:22.796-06:00 level=INFO source=memory.go:356 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[2.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[0 B]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-06T22:07:22.802-06:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\Users\chand\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe runner --model C:\Users\chand\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --mmproj C:\Users\chand\.ollama\models\blobs\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 8 --no-mmap --parallel 1 --port 58622"
time=2025-01-06T22:07:22.960-06:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-06T22:07:22.960-06:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-01-06T22:07:22.962-06:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-01-06T22:07:22.967-06:00 level=INFO source=runner.go:945 msg="starting go runner"
time=2025-01-06T22:07:22.969-06:00 level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8
time=2025-01-06T22:07:22.970-06:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:58622"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\chand.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q6_K: 37 tensors
time=2025-01-06T22:07:23.213-06:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: CPU model buffer size = 5679.33 MiB
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 656.25 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CPU backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-01-06T22:07:37.235-06:00 level=INFO source=server.go:594 msg="llama runner started in 14.27 seconds"
[GIN] 2025/01/06 - 22:07:37 | 200 | 14.5041464s | 127.0.0.1 | POST "/api/generate"
@rick-github commented on GitHub (Jan 7, 2025):
When ollama started, there was 10.8G free VRAM. When it came time to load a model, something else was running and only 3.5G was free. The llama3.2-vision model won't fit, so it loads it into RAM.
The model is unloaded after 5 minutes and then a bit later another request comes in for the model. This time there is 10.5G available and ollama does a partial load (33 of 41 layers) into the GPU.
Your GPU is too small to host the entire model, and other GPU users are occasionally taking VRAM to the point where ollama can't even do a partial load.
@blueApple12 commented on GitHub (Jan 7, 2025):
So I don't have enough vram?
@rick-github commented on GitHub (Jan 7, 2025):
Correct.
@blueApple12 commented on GitHub (Jan 7, 2025):
Is there a way to get around this?
@rick-github commented on GitHub (Jan 7, 2025):
Stop other applications from using the GPU.
https://www.google.com/search?q=windows+switch+default+gpu+to+integrated
https://www.google.com/search?q=windows+restrict+process+from+using+gpu
@blueApple12 commented on GitHub (Jan 7, 2025):
Is there any other way to use less vram like low vram mode?
@rick-github commented on GitHub (Jan 7, 2025):
There are two components that take up VRAM - context and weights. The usual ways of reducing context size (
num_ctx,OLLAMA_NUM_PARALLEL,OLLAMA_FLASH_ATTENTION) won't help because you are already using the minimum context. Other models (eg llama3.2:3b) come in a variety of quantizations which can be used to reduce the size of the weights. The default quant for llama3.2:3b is q4-K_M which is 2G, but the size can be as low as 1.4G with the q2_K quant. Unfortunately llama3.2-vision doesn't offer anything smaller than q4_K_M at 7.9G. I haven't tried this, but in theory could take the base model and quantize it yourself to something smaller. However, I don't think the tool that I usually use for quantizing models (llama.cpp) supports the llama3.2-vision architecture (mllama), so you'd need to find suitable tools.One last alternative would be to force llama.cpp to load all layers into VRAM and then have the GPU overflow to RAM, rather than having ollama decide on the RAM allocation. This will maximize VRAM usage at the cost of a performance penalty for the layers residing in RAM. However, because you can almost fit the model in VRAM, only a few layers will spill into RAM, and the penalty might not be noticeable. You can force this by setting
num_gputo the number of layers (or really any number greater than or equal to the layer count). See here for ways to adjustnum_gpu.@rick-github commented on GitHub (Jan 7, 2025):
@ChandlerHooley
Your GPU has 3.2G free. Just the projector (2.8G) and context space (656M) add up to more than this, so there is no way to run llama3.2-vison on your GPU, even with the
num_gpuhack from above.@blueApple12 commented on GitHub (Jan 7, 2025):
Why is my gpu so full? I just built this pc a week ago. Will the full log of nvidia smi help identify what takes all of the vram?
@rick-github commented on GitHub (Jan 7, 2025):
I'm not a Windows user so fine details of process usage escape me. Try this for help: https://saturncloud.io/blog/how-to-find-and-limit-gpu-usage-by-process-in-windows/#finding-gpu-usage-by-process
@blueApple12 commented on GitHub (Jan 7, 2025):
i really didnt understand, this page, if someone can understand this and help me it would be great.
Tue Jan 7 16:34:55 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 566.36 Driver Version: 566.36 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 ... WDDM | 00000000:01:00.0 On | N/A |
| 30% 35C P5 15W / 220W | 895MiB / 12282MiB | 28% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1244 C+G ...2txyewy\StartMenuExperienceHost.exe N/A |
| 0 N/A N/A 2748 C+G ...air\Corsair iCUE5 Software\iCUE.exe N/A |
| 0 N/A N/A 2836 C+G ...nt.CBS_cw5n1h2txyewy\SearchHost.exe N/A |
| 0 N/A N/A 5704 C+G ...al\Discord\app-1.0.9175\Discord.exe N/A |
| 0 N/A N/A 6336 C+G ...\Cef\CefSharp.BrowserSubprocess.exe N/A |
| 0 N/A N/A 11832 C+G ....0_x64__8wekyb3d8bbwe\XboxPcApp.exe N/A |
| 0 N/A N/A 12004 C+G ...crosoft\Edge\Application\msedge.exe N/A |
| 0 N/A N/A 14244 C+G ...6.0_x64__cv1g1gvanyjgm\WhatsApp.exe N/A |
| 0 N/A N/A 15152 C+G ...oogle\Chrome\Application\chrome.exe N/A |
| 0 N/A N/A 15476 C+G ...n\131.0.2903.112\msedgewebview2.exe N/A |
| 0 N/A N/A 21852 C+G ...n\NVIDIA app\CEF\NVIDIA Overlay.exe N/A |
| 0 N/A N/A 23444 C+G ...ces\Razer Central\Razer Central.exe N/A |
| 0 N/A N/A 23724 C+G ...n\131.0.2903.112\msedgewebview2.exe N/A |
| 0 N/A N/A 23812 C+G ...n\NVIDIA app\CEF\NVIDIA Overlay.exe N/A |
| 0 N/A N/A 24456 C+G ...siveControlPanel\SystemSettings.exe N/A |
| 0 N/A N/A 24876 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A |
| 0 N/A N/A 27828 C+G ...x64__97hta09mmv6hy\Build\Lively.exe N/A |
| 0 N/A N/A 30528 C+G ... Synapse 3 Host\Razer Synapse 3.exe N/A |
| 0 N/A N/A 32416 C+G ...nr4m\radeonsoftware\AMDRSSrcExt.exe N/A |
| 0 N/A N/A 34800 C+G ...m\radeonsoftware\RadeonSoftware.exe N/A |
| 0 N/A N/A 35960 C+G ...oogle\Chrome\Application\chrome.exe N/A |
| 0 N/A N/A 36928 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A |
| 0 N/A N/A 38648 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 41612 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 N/A N/A 41720 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A |
| 0 N/A N/A 42312 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A |
| 0 N/A N/A 44232 C+G C:\Windows\System32\ShellHost.exe N/A |
| 0 N/A N/A 45884 C+G ...s\System32\ApplicationFrameHost.exe N/A |
| 0 N/A N/A 49344 C+G ...Programs\Microsoft VS Code\Code.exe N/A |
| 0 N/A N/A 51312 C+G ...__8wekyb3d8bbwe\WindowsTerminal.exe N/A |
+-----------------------------------------------------------------------------------------+
@rick-github commented on GitHub (Jan 7, 2025):
Unfortunately this is not a really useful output, as it doesn't contain the VRAM usage and the process names are incomplete, so it's not possible to identify the large users of VRAM. But there may be low hanging fruit. Does your machine have an integrated graphics processor? If so, it may be possible to set that as the default GPU for the system in the BIOS, so that when Windows starts it doesn't allocate VRAM from the 4070. The alternative is to set the preferred GPU on a program by program basis as discussed here.
@blueApple12 commented on GitHub (Jan 7, 2025):
I completely disabled the integrated graphics, may it cause it? Because I thought it might use my integrated graphics instead of my gpu.
@rick-github commented on GitHub (Jan 7, 2025):
ollama will not use integrated graphics, there is very little support for those types of GPUs. Enable it, make it the default.
@kreier commented on GitHub (Jan 9, 2025):
Your RAM should be sufficient. This is really strange. I found conflicting statements about your available VRAM in your logfile https://github.com/ollama/ollama/issues/8310#issuecomment-2573402746 just a second apart:
I observed a similar behavior to your 4060 with two GTX 1060 6GB. Starting
llama3.2-visionruns 100% on the CPU (check withollama psafter leaving with\bye). Then I started a similar sized modelphi4and it went 100% to the GPU, split between both graphics cards and used 11GB. I tried a few others, and the vision model was the outlier.Can you try other models (like phi4), that should fit into your VRAM to narrow down this behavior? You have more than 10 GB free, and
llama3.2-visionneeds usually a little more than 9 GB, even though the files are only 7.9 GB large. And even when it can't fit completely into the VRAM, it should split some layers of and process them in regular RAM. With my 8 GB card under Windows I get the following result after running ollama, exiting it and callingollama ps:I checked my logfile, and got a statement
layers.offload=7where you got a zero. Don't know the reason yet:@kreier commented on GitHub (Jan 9, 2025):
A similar behavior was noticed with 6GB VRAM graphics cards in November 2024: https://github.com/ollama/ollama/issues/7509 It works with my 8GB card and the problem described here is for more than 10 GB available VRAM.
@rick-github commented on GitHub (Jan 10, 2025):
It's sufficient if there are no other processes using the GPU. Switching to integrated graphics will help.
These are 67 seconds apart.
Vision models have extra requirements that make it harder to fit them in limited VRAM as discussed in https://github.com/ollama/ollama/issues/7509#issuecomment-2457887328.
@kreier commented on GitHub (Jan 11, 2025):
Thanks @rick-github for the feedback and double-checking my comment. Sorry for the mistake, I should learn how to read the time!
I tested this scenario again, and I'm not sure if llama3.2-vision will fit entirely into 12GB of VRAM. The use of the integrated graphics might be the only way. As pointed out by others above.
First I tried to run llama3.2-vision just on the CPU. To do this I set the parameter
/set parameter num_gpu 0after starting ollama, and then gave it a prompt to process. I checked the RAM usage afterwards withollama psand got a result of 11 GB. Which is less than 12GB, so a 3060 with 12GB might work. Surprisingly when using the 8 GB GPU partially, the speed went down from 5.4 token/s to 4.5 token/s. The stated utilization fromollama pswas 43%/57% CPU/GPU but I think this only relates to RAM, not token generation speed. The GPU seems to be used only for the projector (see below) and the token generation is done entirely by the CPU.On another system with a 8GB card and a 6GB card I got llama3.2-vision almost entirely into the VRAM, just 4% were still processed by the CPU. It resulted in 15 token/s. Following the advice given in this thread I switched to my iGPU of my processor and gained a few Megabyte on the larger card - and finally got 100% GPU utilization. The responsiveness increased by 55% to 23.3 token/s! That's the reward to have all layers in the fast GPU memory!
Here I checked the combined unitization of the GPUs with
nvtop. The larger used 6.937 GiB and the smaller 4.583 GiB. Combined this equals 11.52 GiB. There is not much space left if this should fit into a 12 GB card.ollama pseven reported 13 GB RAM used. The distribution of procession power was heavily skewed, the big 8GB card used only 12% of the GPU power, while the smaller 6GB card got up to 84%.One thing I still don't unterstand is how the memory requirements for the projector combined equal to something very close to 8GB, so any system with graphics cards smaller than 8GB might not even split the models to use the combined VRAM. It was already stated that the vision model is unique in this regard, and need one continous chunk of RAM to opereate. The logfile states:
I can't see how 1.8 + 2.8 results in something 6.837 GiB, even if I add the 656 MiB for kv. Can someone explain the math to me? When using the system with only one 8 GB card the logfile (see above) states that only 7 of the 41 layers were offloaded to the GPU:
This seems to be the "minimum pieces of the model that have to be loaded in VRAM in their entirety for anything to run on the GPU" that @jessegross mentioned in the issue 7509 in November 6, 2024. https://github.com/ollama/ollama/issues/7509#issuecomment-2457887328
@blueApple12 commented on GitHub (Jan 15, 2025):
llama3.2-vision:latest 085a1fdae525 11 GB 100% CPU 4 minutes from now
I ran /set parameter num_gpu 0. it doesnt work why?
@rick-github commented on GitHub (Jan 15, 2025):
num_gpu:0means load 0 layers into the GPU.@kreier commented on GitHub (Jan 16, 2025):
It actually works as intended. It sets the number of GPUs to be utilized to zero and runs entirely on the CPU. And
ollama psjust confirmed that all is done on the CPU. I ran this to see the minimum required continous memory to run llama3.2-vision. The result of 11GB is lower than the VRAM of your GPU with 12 GB, so it might fit. If some layers are split to run on GPU and some on CPU the total memory demand increases, here usually to 13 GB.But with your 12 GB card at least some layers would be processed by the GPU if at least 8 GB are available. If you don't run a game in the background this should be possible. Can you try again to close all applications and just run ollama to see if at least a few (maybe the first 7) layers will be exported to the GPU (needing 8 GB)? Or if you connect your monitor to the iGPU? Then it could be possible to run the complete llama3.2-vision on the GPU - at least according to my calculations.
@blueApple12 commented on GitHub (Jan 16, 2025):
I don't want this to be entirely on gpu, but when I run it normally it still doesn't utilize my gpu.
@rick-github commented on GitHub (Jan 16, 2025):
Have you switched to iGPU? Can you supply server logs?
@kreier commented on GitHub (Jan 16, 2025):
Can you be a little more specific? My card has only 8GB and I'm using Windows, too. First I check the free VRAM with
nvidia-smi:Only 950 MiB are used by the system for Chrome, Youtube videos, etc.
Then I start regular
ollama run llama3.2-visionand have a conversation. Leaving with\byeand checking utilization and VRAM again:Now some 6920 MiB of the GPU VRAM are used, the 52% listed by ollama after
ollama ps. What are your values?@blueApple12 commented on GitHub (Jan 17, 2025):
what is the recomended num gpu for this?
@rick-github commented on GitHub (Jan 17, 2025):
ollama will compute
num_gpuand show it in the log, search forlayers.offload. You can override this in the API call or Modefile if you think ollama is wrong. The maximum value islayers.model.@kreier commented on GitHub (Jan 17, 2025):
Going over the logfile you posted https://github.com/ollama/ollama/issues/8310#issuecomment-2573402746 it looks like your GPU was utilized running
llama3.2-visionandllama3.2 3B instruct. Here a few timestamps and excerpts:It looks like you were going back and forth in using the larger
Llama-3.2-11B-Vision-Instructmodel and the smallerLlama 3.2 3B Instructmodel. And as the logfile shows, all 29 layers of the smaller model were offloaded into the GPU. If you would have checked withollama psyou would have gotten 100% GPU while using 2.9 GiB.As for the vision model, depending on the available VRAM it was partially loaded into your GPU in some instances:
The last one was close! Freeing your VRAM might have fill all 41 layers. Or use of the iGPU. Interestingly if the model is split between GPU and CPU the split parameter states
layers.split="". Only when splitting between several GPUs you get the distribution.So the issue you posted here seems only have applied at 2025-01-05T16:44:53.336+02:00 and 2025-01-05T17:32:29.130+02:00 when your GPU was running out of VRAM. But ollama was using your GPU before and after that.