mirror of
https://github.com/ollama/ollama.git
synced 2026-05-07 08:30:05 -05:00
Closed
opened 2026-04-28 17:22:59 -05:00 by GiteaMirror
·
11 comments
No Branch/Tag Specified
main
hoyyeva/anthropic-local-image-path
dhiltgen/ci
dhiltgen/llama-runner
parth-remove-claude-desktop-launch
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.30.0-rc3
v0.30.0-rc2
v0.30.0-rc1
v0.30.0-rc0
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#50898
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Blake110 on GitHub (Sep 24, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6930
Originally assigned to: @dhiltgen on GitHub.
What is the issue?
P40 with M6000, just P40 works, and M6000 memory not be used by ollama. even modified ollama.service for multi GPU.
I try to use P40 with 1080ti, works fine with default ollama.service. P40 with RTX 2060, works fine with default ollama.service.
anyone can tell me why and is there a chance to make them working together? Thx.
OS
Linux
GPU
Nvidia
CPU
Intel
Ollama version
0.3.11
@rick-github commented on GitHub (Sep 24, 2024):
Server logs may help in debugging.
@Blake110 commented on GitHub (Sep 25, 2024):
thanks for your reply first. logs below.
Sep 25 01:54:13 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:13 | 200 | 43.216µs | 127.0.0.1 | HEAD "/"
Sep 25 01:54:13 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:13 | 200 | 164.080988ms | 127.0.0.1 | GET "/api/tags"
Sep 25 01:54:21 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:21 | 200 | 22.7µs | 127.0.0.1 | HEAD "/"
Sep 25 01:54:21 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:21 | 200 | 83.877µs | 127.0.0.1 | GET "/api/ps"
Sep 25 01:54:49 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:49 | 200 | 24.045µs | 127.0.0.1 | HEAD "/"
Sep 25 01:54:49 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:54:49 | 200 | 1.135969ms | 127.0.0.1 | GET "/api/tags"
Sep 25 01:55:03 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:55:03 | 200 | 22.533µs | 127.0.0.1 | HEAD "/"
Sep 25 01:55:03 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:55:03 | 200 | 186.942878ms | 127.0.0.1 | POST "/api/show"
Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.644-07:00 level=INFO source=server.go:103 msg="system memory" total="78.5 GiB" free="77.2 GiB" free_swap="8.0 GiB"
Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.645-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=47 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="23.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.4 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"
Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.647-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2509469182/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 1 --port 37095"
Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.647-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.647-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.648-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
Sep 25 01:55:03 ai-platform ollama[2194]: INFO [main] build info | build=10 commit="9225b05" tid="139687410753536" timestamp=1727254503
Sep 25 01:55:03 ai-platform ollama[2194]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139687410753536" timestamp=1727254503 total_threads=14
Sep 25 01:55:03 ai-platform ollama[2194]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="37095" tid="139687410753536" timestamp=1727254503
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd (version GGUF V3 (latest))
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 0: general.architecture str = llama
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 1: general.type str = model
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 3: general.finetune str = Instruct
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 5: general.size_label str = 70B
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 6: general.license str = llama3.1
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 9: llama.block_count u32 = 80
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 10: llama.context_length u32 = 131072
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 17: general.file_type u32 = 2
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - type f32: 162 tensors
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - type q4_0: 561 tensors
Sep 25 01:55:03 ai-platform ollama[2049]: llama_model_loader: - type q6_K: 1 tensors
Sep 25 01:55:03 ai-platform ollama[2049]: time=2024-09-25T01:55:03.899-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_vocab: special tokens cache size = 256
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_vocab: token to piece cache size = 0.7999 MB
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: format = GGUF V3 (latest)
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: arch = llama
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: vocab type = BPE
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_vocab = 128256
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_merges = 280147
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: vocab_only = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_ctx_train = 131072
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd = 8192
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_layer = 80
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_head = 64
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_head_kv = 8
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_rot = 128
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_swa = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_head_k = 128
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_head_v = 128
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_gqa = 8
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_k_gqa = 1024
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_embd_v_gqa = 1024
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_norm_eps = 0.0e+00
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: f_logit_scale = 0.0e+00
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_ff = 28672
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_expert = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_expert_used = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: causal attn = 1
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: pooling type = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: rope type = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: rope scaling = linear
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: freq_base_train = 500000.0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: freq_scale_train = 1
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: n_ctx_orig_yarn = 131072
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: rope_finetuned = unknown
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_d_conv = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_d_inner = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_d_state = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_dt_rank = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: ssm_dt_b_c_rms = 0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model type = 70B
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model ftype = Q4_0
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model params = 70.55 B
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW)
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: LF token = 128 'Ä'
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: max token length = 256
Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: found 1 CUDA devices:
Sep 25 01:55:04 ai-platform ollama[2049]: Device 0: Tesla P40, compute capability 6.1, VMM: yes
Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_tensors: ggml ctx size = 0.68 MiB
Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: offloading 47 repeating layers to GPU
Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: offloaded 47/81 layers to GPU
Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: CPU buffer size = 38110.61 MiB
Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: CUDA0 buffer size = 21575.95 MiB
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_ctx = 2048
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_batch = 512
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_ubatch = 512
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: flash_attn = 0
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: freq_base = 500000.0
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: freq_scale = 1
Sep 25 01:57:36 ai-platform ollama[2049]: llama_kv_cache_init: CUDA_Host KV buffer size = 264.00 MiB
Sep 25 01:57:36 ai-platform ollama[2049]: llama_kv_cache_init: CUDA0 KV buffer size = 376.00 MiB
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: graph nodes = 2566
Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: graph splits = 433
Sep 25 01:57:37 ai-platform ollama[2194]: INFO [main] model loaded | tid="139687410753536" timestamp=1727254657
Sep 25 01:57:37 ai-platform ollama[2049]: time=2024-09-25T01:57:37.386-07:00 level=INFO source=server.go:626 msg="llama runner started in 153.74 seconds"
Sep 25 01:57:37 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:57:37 | 200 | 2m34s | 127.0.0.1 | POST "/api/generate"
@Blake110 commented on GitHub (Sep 25, 2024):
and here a screenshot for nvtop. I loaded llama3.1:70b.

@rick-github commented on GitHub (Sep 25, 2024):
Please post the full log, there is information earlier in the log which shows device detection.
@Blake110 commented on GitHub (Sep 25, 2024):
all about ollama logs are here. BTW, after pulling out the P40, the M6000 can work with ollama under the same NV cuda driver.
and I tried the RTX 2060 with P40 and 1080ti with p40 with the same NV cuda driver, they can work together with no issus. So the different GPU architecture should be ok.
I used NVIDIA-Linux-x86_64-550.100.run driver
in line 352 - 353 - 354, (line 494 - 495 - 496 )2 gpu was found.
" 352 Sep 25 13:09:20 ai-platform ollama[863]: time=2024-09-25T13:09:20.795-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
353 Sep 25 13:09:22 ai-platform ollama[863]: time=2024-09-25T13:09:22.492-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
354 Sep 25 13:09:22 ai-platform ollama[863]: time=2024-09-25T13:09:22.492-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"
and in line 719 - 720, just P40 was found with ggml_cuda.
719 Sep 25 13:57:37 ai-platform ollama[863]: ggml_cuda_init: found 1 CUDA devices:
720 Sep 25 13:57:37 ai-platform ollama[863]: Device 0: Tesla P40, compute capability 6.1, VMM: yes
100 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
101 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: LF token = 128 'Ä'
102 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
103 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_print_meta: max token length = 256
104 Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
105 Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
106 Sep 25 01:55:04 ai-platform ollama[2049]: ggml_cuda_init: found 1 CUDA devices:
107 Sep 25 01:55:04 ai-platform ollama[2049]: Device 0: Tesla P40, compute capability 6.1, VMM: yes
108 Sep 25 01:55:04 ai-platform ollama[2049]: llm_load_tensors: ggml ctx size = 0.68 MiB
109 Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: offloading 47 repeating layers to GPU
110 Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: offloaded 47/81 layers to GPU
111 Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: CPU buffer size = 38110.61 MiB
112 Sep 25 01:57:33 ai-platform ollama[2049]: llm_load_tensors: CUDA0 buffer size = 21575.95 MiB
113 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_ctx = 2048
114 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_batch = 512
115 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: n_ubatch = 512
116 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: flash_attn = 0
117 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: freq_base = 500000.0
118 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: freq_scale = 1
119 Sep 25 01:57:36 ai-platform ollama[2049]: llama_kv_cache_init: CUDA_Host KV buffer size = 264.00 MiB
120 Sep 25 01:57:36 ai-platform ollama[2049]: llama_kv_cache_init: CUDA0 KV buffer size = 376.00 MiB
121 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
122 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
123 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
124 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
125 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: graph nodes = 2566
126 Sep 25 01:57:36 ai-platform ollama[2049]: llama_new_context_with_model: graph splits = 433
127 Sep 25 01:57:37 ai-platform ollama[2194]: INFO [main] model loaded | tid="139687410753536" timestamp=1727254657
128 Sep 25 01:57:37 ai-platform ollama[2049]: time=2024-09-25T01:57:37.386-07:00 level=INFO source=server.go:626 msg="llama runner started in 153.74 seconds"
129 Sep 25 01:57:37 ai-platform ollama[2049]: [GIN] 2024/09/25 - 01:57:37 | 200 | 2m34s | 127.0.0.1 | POST "/api/generate"
130 Sep 25 02:08:35 ai-platform systemd[1]: Stopping Ollama Service...
131 Sep 25 02:08:36 ai-platform systemd[1]: ollama.service: Deactivated successfully.
132 Sep 25 02:08:36 ai-platform systemd[1]: Stopped Ollama Service.
133 Sep 25 02:08:36 ai-platform systemd[1]: ollama.service: Consumed 1min 4.919s CPU time.
134 -- Boot 997666db52994643b5bfc4ed04149e37 --
135 Sep 25 02:10:03 ai-platform systemd[1]: Started Ollama Service.
136 Sep 25 02:10:03 ai-platform ollama[846]: 2024/09/25 02:10:03 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
137 Sep 25 02:10:03 ai-platform ollama[846]: time=2024-09-25T02:10:03.720-07:00 level=INFO source=images.go:753 msg="total blobs: 33"
138 Sep 25 02:10:03 ai-platform ollama[846]: time=2024-09-25T02:10:03.729-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
139 Sep 25 02:10:03 ai-platform ollama[846]: time=2024-09-25T02:10:03.730-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
140 Sep 25 02:10:03 ai-platform ollama[846]: time=2024-09-25T02:10:03.732-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama4199965130/runners
141 Sep 25 02:10:19 ai-platform ollama[846]: time=2024-09-25T02:10:19.610-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]"
142 Sep 25 02:10:19 ai-platform ollama[846]: time=2024-09-25T02:10:19.612-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
143 Sep 25 02:10:21 ai-platform ollama[846]: time=2024-09-25T02:10:21.299-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
144 Sep 25 02:10:21 ai-platform ollama[846]: time=2024-09-25T02:10:21.299-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"
145 Sep 25 02:12:57 ai-platform ollama[846]: [GIN] 2024/09/25 - 02:12:57 | 200 | 996.696µs | 127.0.0.1 | HEAD "/"
146 Sep 25 02:12:57 ai-platform ollama[846]: [GIN] 2024/09/25 - 02:12:57 | 200 | 180.045µs | 127.0.0.1 | GET "/api/ps"
147 Sep 25 02:13:13 ai-platform systemd[1]: Stopping Ollama Service...
148 Sep 25 02:13:14 ai-platform systemd[1]: ollama.service: Deactivated successfully.
149 Sep 25 02:13:14 ai-platform systemd[1]: Stopped Ollama Service.
150 Sep 25 02:13:14 ai-platform systemd[1]: ollama.service: Consumed 30.152s CPU time.
151 -- Boot 37f48c8c97cd44a0bb24888eb055fc69 --
152 Sep 25 02:17:12 ai-platform systemd[1]: Started Ollama Service.
153 Sep 25 02:17:15 ai-platform ollama[869]: 2024/09/25 02:17:15 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
154 Sep 25 02:17:15 ai-platform ollama[869]: time=2024-09-25T02:17:15.092-07:00 level=INFO source=images.go:753 msg="total blobs: 33"
155 Sep 25 02:17:15 ai-platform ollama[869]: time=2024-09-25T02:17:15.165-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
156 Sep 25 02:17:15 ai-platform ollama[869]: time=2024-09-25T02:17:15.170-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
157 Sep 25 02:17:15 ai-platform ollama[869]: time=2024-09-25T02:17:15.171-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama4047483176/runners
158 Sep 25 02:17:59 ai-platform ollama[869]: time=2024-09-25T02:17:59.819-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102 cpu]"
159 Sep 25 02:17:59 ai-platform ollama[869]: time=2024-09-25T02:17:59.820-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
160 Sep 25 02:18:01 ai-platform ollama[869]: time=2024-09-25T02:18:01.531-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
161 Sep 25 02:18:01 ai-platform ollama[869]: time=2024-09-25T02:18:01.532-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"
162 Sep 25 02:23:45 ai-platform systemd[1]: Stopping Ollama Service...
163 Sep 25 02:23:46 ai-platform systemd[1]: ollama.service: Deactivated successfully.
164 Sep 25 02:23:46 ai-platform systemd[1]: Stopped Ollama Service.
165 Sep 25 02:23:46 ai-platform systemd[1]: ollama.service: Consumed 30.855s CPU time.
166 -- Boot 573ea622850a4a3d8eb2de36dd38cae3 --
167 Sep 25 02:25:05 ai-platform systemd[1]: Started Ollama Service.
168 Sep 25 02:25:07 ai-platform ollama[864]: 2024/09/25 02:25:07 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
169 Sep 25 02:25:08 ai-platform ollama[864]: time=2024-09-25T02:25:08.064-07:00 level=INFO source=images.go:753 msg="total blobs: 33"
170 Sep 25 02:25:08 ai-platform ollama[864]: time=2024-09-25T02:25:08.803-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
171 Sep 25 02:25:08 ai-platform ollama[864]: time=2024-09-25T02:25:08.810-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
172 Sep 25 02:25:08 ai-platform ollama[864]: time=2024-09-25T02:25:08.817-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2283343905/runners
173 Sep 25 02:25:51 ai-platform ollama[864]: time=2024-09-25T02:25:51.063-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx2 cuda_v11 cuda_v12 rocm_v60102 cpu cpu_avx]"
174 Sep 25 02:25:51 ai-platform ollama[864]: time=2024-09-25T02:25:51.070-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
175 Sep 25 02:25:52 ai-platform ollama[864]: time=2024-09-25T02:25:52.804-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
176 Sep 25 02:25:52 ai-platform ollama[864]: time=2024-09-25T02:25:52.804-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"
177 Sep 25 02:49:48 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:49:48 | 200 | 806.784µs | 127.0.0.1 | HEAD "/"
178 Sep 25 02:49:48 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:49:48 | 200 | 14.434265ms | 127.0.0.1 | GET "/api/tags"
179 Sep 25 02:55:02 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:55:02 | 200 | 37.488µs | 127.0.0.1 | HEAD "/"
180 Sep 25 02:55:02 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:55:02 | 200 | 79.297887ms | 127.0.0.1 | DELETE "/api/delete"
181 Sep 25 02:55:07 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:55:07 | 200 | 35.511µs | 127.0.0.1 | HEAD "/"
182 Sep 25 02:55:07 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:55:07 | 200 | 1.096253ms | 127.0.0.1 | GET "/api/tags"
183 Sep 25 02:55:25 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:55:25 | 200 | 24.964µs | 127.0.0.1 | HEAD "/"
184 Sep 25 02:55:26 ai-platform ollama[864]: time=2024-09-25T02:55:26.794-07:00 level=INFO source=download.go:175 msg="downloading 09cd6813dc2e in 17 1 GB part(s)"
185 Sep 25 02:55:52 ai-platform ollama[864]: time=2024-09-25T02:55:52.901-07:00 level=INFO source=download.go:370 msg="09cd6813dc2e part 10 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
186 Sep 25 02:55:52 ai-platform ollama[864]: time=2024-09-25T02:55:52.901-07:00 level=INFO source=download.go:370 msg="09cd6813dc2e part 15 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
187 Sep 25 02:55:52 ai-platform ollama[864]: time=2024-09-25T02:55:52.901-07:00 level=INFO source=download.go:370 msg="09cd6813dc2e part 9 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
188 Sep 25 02:55:52 ai-platform ollama[864]: time=2024-09-25T02:55:52.901-07:00 level=INFO source=download.go:370 msg="09cd6813dc2e part 7 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
189 Sep 25 02:58:03 ai-platform ollama[864]: time=2024-09-25T02:58:03.626-07:00 level=INFO source=download.go:175 msg="downloading 948af2743fc7 in 1 1.5 KB part(s)"
190 Sep 25 02:58:05 ai-platform ollama[864]: time=2024-09-25T02:58:05.536-07:00 level=INFO source=download.go:175 msg="downloading daa7d15f6d0b in 1 484 B part(s)"
191 Sep 25 02:58:53 ai-platform ollama[864]: [GIN] 2024/09/25 - 02:58:53 | 200 | 3m28s | 127.0.0.1 | POST "/api/pull"
192 Sep 25 03:00:30 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:00:30 | 200 | 28.578µs | 127.0.0.1 | HEAD "/"
193 Sep 25 03:00:30 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:00:30 | 200 | 1.274784ms | 127.0.0.1 | GET "/api/tags"
194 Sep 25 03:00:49 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:00:49 | 200 | 22.555µs | 127.0.0.1 | HEAD "/"
195 Sep 25 03:00:49 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:00:49 | 404 | 135.621µs | 127.0.0.1 | POST "/api/show"
196 Sep 25 03:00:50 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:00:50 | 200 | 468.40178ms | 127.0.0.1 | POST "/api/pull"
197 Sep 25 03:01:06 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:01:06 | 200 | 27.667µs | 127.0.0.1 | HEAD "/"
198 Sep 25 03:01:06 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:01:06 | 200 | 26.493541ms | 127.0.0.1 | POST "/api/show"
199 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.425-07:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-09cd6813dc2e73d9db9345123ee1b3385bb7cee88a46f13dc37bc3d5e96ba3a4 gpu=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea parallel=4 available=25470566400 required="16.4 GiB"
200 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.425-07:00 level=INFO source=server.go:103 msg="system memory" total="62.8 GiB" free="61.5 GiB" free_swap="8.0 GiB"
201 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.426-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="16.4 GiB" memory.required.partial="16.4 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[16.4 GiB]" memory.weights.total="14.0 GiB" memory.weights.repeating="13.0 GiB" memory.weights.nonrepeating="1002.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
202 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.428-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2283343905/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-09cd6813dc2e73d9db9345123ee1b3385bb7cee88a46f13dc37bc3d5e96ba3a4 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 34317"
203 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.429-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
204 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.429-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
205 Sep 25 03:01:06 ai-platform ollama[864]: time=2024-09-25T03:01:06.429-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
206 Sep 25 03:01:07 ai-platform ollama[1805]: INFO [main] build info | build=10 commit="9225b05" tid="140325808095232" timestamp=1727258467
207 Sep 25 03:01:07 ai-platform ollama[1805]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140325808095232" timestamp=1727258467 total_threads=14
208 Sep 25 03:01:07 ai-platform ollama[1805]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="34317" tid="140325808095232" timestamp=1727258467
209 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-09cd6813dc2e73d9db9345123ee1b3385bb7cee88a46f13dc37bc3d5e96ba3a4 (version GGUF V3 (latest))
210 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
211 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 0: general.architecture str = llama
212 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 1: general.type str = model
213 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
214 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 3: general.finetune str = Instruct
215 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
216 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 5: general.size_label str = 8B
217 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 6: general.license str = llama3.1
218 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
219 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
220 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 9: llama.block_count u32 = 32
221 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 10: llama.context_length u32 = 131072
222 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
223 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
224 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
225 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
226 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
227 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
228 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 17: general.file_type u32 = 1
229 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
230 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
231 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
232 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
233 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
234 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
235 Sep 25 03:01:07 ai-platform ollama[864]: time=2024-09-25T03:01:07.685-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
236 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
237 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
238 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
239 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
240 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
241 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - type f32: 66 tensors
242 Sep 25 03:01:07 ai-platform ollama[864]: llama_model_loader: - type f16: 226 tensors
243 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_vocab: special tokens cache size = 256
244 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_vocab: token to piece cache size = 0.7999 MB
245 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: format = GGUF V3 (latest)
246 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: arch = llama
247 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: vocab type = BPE
248 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_vocab = 128256
249 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_merges = 280147
250 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: vocab_only = 0
251 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_ctx_train = 131072
252 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_embd = 4096
253 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_layer = 32
254 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_head = 32
255 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_head_kv = 8
256 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_rot = 128
257 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_swa = 0
258 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_embd_head_k = 128
259 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_embd_head_v = 128
260 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_gqa = 4
261 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_embd_k_gqa = 1024
262 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_embd_v_gqa = 1024
263 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: f_norm_eps = 0.0e+00
264 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
265 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
266 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
267 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: f_logit_scale = 0.0e+00
268 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_ff = 14336
269 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_expert = 0
270 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_expert_used = 0
271 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: causal attn = 1
272 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: pooling type = 0
273 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: rope type = 0
274 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: rope scaling = linear
275 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: freq_base_train = 500000.0
276 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: freq_scale_train = 1
277 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: n_ctx_orig_yarn = 131072
278 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: rope_finetuned = unknown
279 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: ssm_d_conv = 0
280 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: ssm_d_inner = 0
281 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: ssm_d_state = 0
282 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: ssm_dt_rank = 0
283 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: ssm_dt_b_c_rms = 0
284 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: model type = 8B
285 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: model ftype = F16
286 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: model params = 8.03 B
287 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: model size = 14.96 GiB (16.00 BPW)
288 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct
289 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
290 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
291 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: LF token = 128 'Ä'
292 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
293 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_print_meta: max token length = 256
294 Sep 25 03:01:08 ai-platform ollama[864]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
295 Sep 25 03:01:08 ai-platform ollama[864]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
296 Sep 25 03:01:08 ai-platform ollama[864]: ggml_cuda_init: found 1 CUDA devices:
297 Sep 25 03:01:08 ai-platform ollama[864]: Device 0: Tesla P40, compute capability 6.1, VMM: yes
298 Sep 25 03:01:08 ai-platform ollama[864]: llm_load_tensors: ggml ctx size = 0.27 MiB
299 Sep 25 03:01:09 ai-platform ollama[864]: time=2024-09-25T03:01:09.140-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
300 Sep 25 03:01:09 ai-platform ollama[864]: llm_load_tensors: offloading 32 repeating layers to GPU
301 Sep 25 03:01:09 ai-platform ollama[864]: llm_load_tensors: offloading non-repeating layers to GPU
302 Sep 25 03:01:09 ai-platform ollama[864]: llm_load_tensors: offloaded 33/33 layers to GPU
303 Sep 25 03:01:09 ai-platform ollama[864]: llm_load_tensors: CPU buffer size = 1002.00 MiB
304 Sep 25 03:01:09 ai-platform ollama[864]: llm_load_tensors: CUDA0 buffer size = 14315.02 MiB
305 Sep 25 03:01:09 ai-platform ollama[864]: time=2024-09-25T03:01:09.843-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
306 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: n_ctx = 8192
307 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: n_batch = 512
308 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: n_ubatch = 512
309 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: flash_attn = 0
310 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: freq_base = 500000.0
311 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: freq_scale = 1
312 Sep 25 03:01:11 ai-platform ollama[864]: llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB
313 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
314 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB
315 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB
316 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB
317 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: graph nodes = 1030
318 Sep 25 03:01:11 ai-platform ollama[864]: llama_new_context_with_model: graph splits = 2
319 Sep 25 03:01:12 ai-platform ollama[1805]: INFO [main] model loaded | tid="140325808095232" timestamp=1727258472
320 Sep 25 03:01:12 ai-platform ollama[864]: time=2024-09-25T03:01:12.354-07:00 level=INFO source=server.go:626 msg="llama runner started in 5.93 seconds"
321 Sep 25 03:01:12 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:01:12 | 200 | 6.266740535s | 127.0.0.1 | POST "/api/generate"
322 Sep 25 03:01:26 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:01:26 | 200 | 1.650481207s | 127.0.0.1 | POST "/api/chat"
323 Sep 25 03:02:29 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:02:29 | 200 | 27.755757066s | 127.0.0.1 | POST "/api/chat"
324 Sep 25 03:05:31 ai-platform ollama[864]: [GIN] 2024/09/25 - 03:05:31 | 200 | 34.507382868s | 127.0.0.1 | POST "/api/chat"
325 Sep 25 06:26:33 ai-platform systemd[1]: Stopping Ollama Service...
326 Sep 25 06:26:34 ai-platform systemd[1]: ollama.service: Deactivated successfully.
327 Sep 25 06:26:34 ai-platform systemd[1]: Stopped Ollama Service.
328 Sep 25 06:26:34 ai-platform systemd[1]: ollama.service: Consumed 6min 21.822s CPU time.
329 -- Boot 07e33ef45ce6476f8795bb10410b0122 --
330 Sep 25 12:42:32 ai-platform systemd[1]: Started Ollama Service.
331 Sep 25 12:42:32 ai-platform ollama[861]: 2024/09/25 12:42:32 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
332 Sep 25 12:42:32 ai-platform ollama[861]: time=2024-09-25T12:42:32.988-07:00 level=INFO source=images.go:753 msg="total blobs: 31"
333 Sep 25 12:42:32 ai-platform ollama[861]: time=2024-09-25T12:42:32.997-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
334 Sep 25 12:42:32 ai-platform ollama[861]: time=2024-09-25T12:42:32.999-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
335 Sep 25 12:42:33 ai-platform ollama[861]: time=2024-09-25T12:42:33.003-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama1762789208/runners
336 Sep 25 12:42:48 ai-platform ollama[861]: time=2024-09-25T12:42:48.061-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]"
337 Sep 25 12:42:48 ai-platform ollama[861]: time=2024-09-25T12:42:48.069-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
338 Sep 25 12:42:50 ai-platform ollama[861]: time=2024-09-25T12:42:50.306-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
339 Sep 25 12:42:50 ai-platform ollama[861]: time=2024-09-25T12:42:50.306-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"
340 Sep 25 13:07:38 ai-platform systemd[1]: Stopping Ollama Service...
341 Sep 25 13:07:40 ai-platform systemd[1]: ollama.service: Deactivated successfully.
342 Sep 25 13:07:40 ai-platform systemd[1]: Stopped Ollama Service.
343 Sep 25 13:07:40 ai-platform systemd[1]: ollama.service: Consumed 29.339s CPU time.
344 -- Boot 24c5cad9e4db4be8951d9cf2bc3114c5 --
345 Sep 25 13:08:55 ai-platform systemd[1]: Started Ollama Service.
346 Sep 25 13:08:59 ai-platform ollama[863]: 2024/09/25 13:08:59 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
347 Sep 25 13:08:59 ai-platform ollama[863]: time=2024-09-25T13:08:59.497-07:00 level=INFO source=images.go:753 msg="total blobs: 31"
348 Sep 25 13:08:59 ai-platform ollama[863]: time=2024-09-25T13:08:59.546-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
349 Sep 25 13:08:59 ai-platform ollama[863]: time=2024-09-25T13:08:59.547-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
350 Sep 25 13:08:59 ai-platform ollama[863]: time=2024-09-25T13:08:59.547-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2555802688/runners
351 Sep 25 13:09:20 ai-platform ollama[863]: time=2024-09-25T13:09:20.794-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102 cpu]"
352 Sep 25 13:09:20 ai-platform ollama[863]: time=2024-09-25T13:09:20.795-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
353 Sep 25 13:09:22 ai-platform ollama[863]: time=2024-09-25T13:09:22.492-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
354 Sep 25 13:09:22 ai-platform ollama[863]: time=2024-09-25T13:09:22.492-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"
355 Sep 25 13:15:35 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:15:35 | 200 | 28.717µs | 127.0.0.1 | HEAD "/"
356 Sep 25 13:15:35 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:15:35 | 200 | 11.959577ms | 127.0.0.1 | GET "/api/tags"
357 Sep 25 13:15:57 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:15:57 | 200 | 24.363µs | 127.0.0.1 | HEAD "/"
358 Sep 25 13:15:57 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:15:57 | 200 | 160.361228ms | 127.0.0.1 | POST "/api/show"
359 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.207-07:00 level=INFO source=server.go:103 msg="system memory" total="62.8 GiB" free="61.6 GiB" free_swap="8.0 GiB"
360 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.208-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=47 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="23.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.4 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"
361 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.210-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2555802688/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 1 --port 33157"
362 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.211-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
363 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.211-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
364 Sep 25 13:15:58 ai-platform ollama[863]: time=2024-09-25T13:15:58.211-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
365 Sep 25 13:15:59 ai-platform ollama[1836]: INFO [main] build info | build=10 commit="9225b05" tid="139726698635264" timestamp=1727295359
366 Sep 25 13:15:59 ai-platform ollama[1836]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139726698635264" timestamp=1727295359 total_threads=14
367 Sep 25 13:15:59 ai-platform ollama[1836]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="33157" tid="139726698635264" timestamp=1727295359
368 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd (version GGUF V3 (latest))
369 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
370 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 0: general.architecture str = llama
371 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 1: general.type str = model
372 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct
373 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 3: general.finetune str = Instruct
374 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
375 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 5: general.size_label str = 70B
376 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 6: general.license str = llama3.1
377 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
378 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
379 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 9: llama.block_count u32 = 80
380 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 10: llama.context_length u32 = 131072
381 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192
382 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672
383 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64
384 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
385 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
386 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
387 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 17: general.file_type u32 = 2
388 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
389 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
390 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
391 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
392 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
393 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
394 Sep 25 13:15:59 ai-platform ollama[863]: time=2024-09-25T13:15:59.717-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
395 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
396 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
397 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
398 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
399 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
400 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - type f32: 162 tensors
401 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - type q4_0: 561 tensors
402 Sep 25 13:15:59 ai-platform ollama[863]: llama_model_loader: - type q6_K: 1 tensors
403 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_vocab: special tokens cache size = 256
404 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_vocab: token to piece cache size = 0.7999 MB
405 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: format = GGUF V3 (latest)
406 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: arch = llama
407 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: vocab type = BPE
408 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_vocab = 128256
409 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_merges = 280147
410 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: vocab_only = 0
411 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_ctx_train = 131072
412 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_embd = 8192
413 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_layer = 80
414 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_head = 64
415 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_head_kv = 8
416 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_rot = 128
417 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_swa = 0
418 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_k = 128
419 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_v = 128
420 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_gqa = 8
421 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_embd_k_gqa = 1024
422 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_embd_v_gqa = 1024
423 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: f_norm_eps = 0.0e+00
424 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
425 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
426 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
427 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: f_logit_scale = 0.0e+00
428 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_ff = 28672
429 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_expert = 0
430 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_expert_used = 0
431 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: causal attn = 1
432 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: pooling type = 0
433 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: rope type = 0
434 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: rope scaling = linear
435 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: freq_base_train = 500000.0
436 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: freq_scale_train = 1
437 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: n_ctx_orig_yarn = 131072
438 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: rope_finetuned = unknown
439 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: ssm_d_conv = 0
440 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: ssm_d_inner = 0
441 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: ssm_d_state = 0
442 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_rank = 0
443 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_b_c_rms = 0
444 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: model type = 70B
445 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: model ftype = Q4_0
446 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: model params = 70.55 B
447 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW)
448 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct
449 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
450 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
451 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: LF token = 128 'Ä'
452 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
453 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_print_meta: max token length = 256
454 Sep 25 13:16:00 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
455 Sep 25 13:16:00 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
456 Sep 25 13:16:00 ai-platform ollama[863]: ggml_cuda_init: found 1 CUDA devices:
457 Sep 25 13:16:00 ai-platform ollama[863]: Device 0: Tesla P40, compute capability 6.1, VMM: yes
458 Sep 25 13:16:00 ai-platform ollama[863]: llm_load_tensors: ggml ctx size = 0.68 MiB
459 Sep 25 13:20:04 ai-platform ollama[863]: llm_load_tensors: offloading 47 repeating layers to GPU
460 Sep 25 13:20:04 ai-platform ollama[863]: llm_load_tensors: offloaded 47/81 layers to GPU
461 Sep 25 13:20:04 ai-platform ollama[863]: llm_load_tensors: CPU buffer size = 38110.61 MiB
462 Sep 25 13:20:04 ai-platform ollama[863]: llm_load_tensors: CUDA0 buffer size = 21575.95 MiB
463 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: n_ctx = 2048
464 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: n_batch = 512
465 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: n_ubatch = 512
466 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: flash_attn = 0
467 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: freq_base = 500000.0
468 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: freq_scale = 1
469 Sep 25 13:20:06 ai-platform ollama[863]: llama_kv_cache_init: CUDA_Host KV buffer size = 264.00 MiB
470 Sep 25 13:20:06 ai-platform ollama[863]: llama_kv_cache_init: CUDA0 KV buffer size = 376.00 MiB
471 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
472 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
473 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
474 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
475 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: graph nodes = 2566
476 Sep 25 13:20:06 ai-platform ollama[863]: llama_new_context_with_model: graph splits = 433
477 Sep 25 13:20:07 ai-platform ollama[1836]: INFO [main] model loaded | tid="139726698635264" timestamp=1727295607
478 Sep 25 13:20:07 ai-platform ollama[863]: time=2024-09-25T13:20:07.942-07:00 level=INFO source=server.go:626 msg="llama runner started in 249.73 seconds"
479 Sep 25 13:20:07 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:20:07 | 200 | 4m10s | 127.0.0.1 | POST "/api/generate"
480 Sep 25 13:21:49 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:21:49 | 200 | 8.160347761s | 127.0.0.1 | POST "/api/chat"
481 Sep 25 13:22:11 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:22:11 | 200 | 14.763904056s | 127.0.0.1 | POST "/api/chat"
482 Sep 25 13:27:04 ai-platform systemd[1]: Stopping Ollama Service...
483 Sep 25 13:27:07 ai-platform systemd[1]: ollama.service: Deactivated successfully.
484 Sep 25 13:27:07 ai-platform systemd[1]: Stopped Ollama Service.
485 Sep 25 13:27:07 ai-platform systemd[1]: ollama.service: Consumed 5min 33.070s CPU time.
486 -- Boot 16c6f123db1c41d89aa8afa1dcd6c4fc --
487 Sep 25 13:28:14 ai-platform systemd[1]: Started Ollama Service.
488 Sep 25 13:28:16 ai-platform ollama[863]: 2024/09/25 13:28:16 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
489 Sep 25 13:28:16 ai-platform ollama[863]: time=2024-09-25T13:28:16.609-07:00 level=INFO source=images.go:753 msg="total blobs: 31"
490 Sep 25 13:28:16 ai-platform ollama[863]: time=2024-09-25T13:28:16.666-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
491 Sep 25 13:28:16 ai-platform ollama[863]: time=2024-09-25T13:28:16.668-07:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
492 Sep 25 13:28:16 ai-platform ollama[863]: time=2024-09-25T13:28:16.668-07:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2431661027/runners
493 Sep 25 13:28:46 ai-platform ollama[863]: time=2024-09-25T13:28:46.470-07:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]"
494 Sep 25 13:28:46 ai-platform ollama[863]: time=2024-09-25T13:28:46.470-07:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
495 Sep 25 13:28:48 ai-platform ollama[863]: time=2024-09-25T13:28:48.635-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-8b16ac03-19ea-264b-44f5-0ba4e7a3cdea library=cuda variant=v12 compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
496 Sep 25 13:28:48 ai-platform ollama[863]: time=2024-09-25T13:28:48.635-07:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-9402de2e-20d7-ff58-45c7-f25025132ba7 library=cuda variant=v11 compute=5.2 driver=12.4 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"
497 Sep 25 13:45:49 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:45:49 | 200 | 1.00897ms | 127.0.0.1 | HEAD "/"
498 Sep 25 13:45:49 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:45:49 | 200 | 14.162083ms | 127.0.0.1 | GET "/api/tags"
499 Sep 25 13:46:01 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:46:01 | 200 | 40.385µs | 127.0.0.1 | HEAD "/"
500 Sep 25 13:46:01 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:46:01 | 200 | 549.982929ms | 127.0.0.1 | POST "/api/show"
501 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.049-07:00 level=INFO source=server.go:103 msg="system memory" total="62.8 GiB" free="61.6 GiB" free_swap="8.0 GiB"
502 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.051-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=47 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="23.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.4 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"
503 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.052-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2431661027/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 1 --port 40727"
504 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.053-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
505 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.053-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
506 Sep 25 13:46:02 ai-platform ollama[863]: time=2024-09-25T13:46:02.053-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
507 Sep 25 13:46:03 ai-platform ollama[1842]: INFO [main] build info | build=10 commit="9225b05" tid="140313977905152" timestamp=1727297163
508 Sep 25 13:46:03 ai-platform ollama[1842]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140313977905152" timestamp=1727297163 total_threads=14
509 Sep 25 13:46:03 ai-platform ollama[1842]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="40727" tid="140313977905152" timestamp=1727297163
510 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd (version GGUF V3 (latest))
511 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
512 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 0: general.architecture str = llama
513 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 1: general.type str = model
514 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct
515 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 3: general.finetune str = Instruct
516 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
517 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 5: general.size_label str = 70B
518 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 6: general.license str = llama3.1
519 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
520 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
521 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 9: llama.block_count u32 = 80
522 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 10: llama.context_length u32 = 131072
523 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192
524 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672
525 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64
526 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
527 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
528 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
529 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 17: general.file_type u32 = 2
530 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
531 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
532 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
533 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
534 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
535 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
536 Sep 25 13:46:03 ai-platform ollama[863]: time=2024-09-25T13:46:03.308-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
537 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
538 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
539 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
540 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
541 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
542 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - type f32: 162 tensors
543 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - type q4_0: 561 tensors
544 Sep 25 13:46:03 ai-platform ollama[863]: llama_model_loader: - type q6_K: 1 tensors
545 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_vocab: special tokens cache size = 256
546 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_vocab: token to piece cache size = 0.7999 MB
547 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: format = GGUF V3 (latest)
548 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: arch = llama
549 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: vocab type = BPE
550 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_vocab = 128256
551 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_merges = 280147
552 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: vocab_only = 0
553 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_ctx_train = 131072
554 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_embd = 8192
555 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_layer = 80
556 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_head = 64
557 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_head_kv = 8
558 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_rot = 128
559 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_swa = 0
560 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_k = 128
561 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_v = 128
562 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_gqa = 8
563 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_embd_k_gqa = 1024
564 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_embd_v_gqa = 1024
565 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: f_norm_eps = 0.0e+00
566 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
567 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
568 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
569 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: f_logit_scale = 0.0e+00
570 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_ff = 28672
571 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_expert = 0
572 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_expert_used = 0
573 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: causal attn = 1
574 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: pooling type = 0
575 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: rope type = 0
576 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: rope scaling = linear
577 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: freq_base_train = 500000.0
578 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: freq_scale_train = 1
579 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: n_ctx_orig_yarn = 131072
580 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: rope_finetuned = unknown
581 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: ssm_d_conv = 0
582 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: ssm_d_inner = 0
583 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: ssm_d_state = 0
584 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_rank = 0
585 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_b_c_rms = 0
586 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: model type = 70B
587 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: model ftype = Q4_0
588 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: model params = 70.55 B
589 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW)
590 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct
591 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
592 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
593 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: LF token = 128 'Ä'
594 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
595 Sep 25 13:46:03 ai-platform ollama[863]: llm_load_print_meta: max token length = 256
596 Sep 25 13:46:03 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
597 Sep 25 13:46:03 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
598 Sep 25 13:46:03 ai-platform ollama[863]: ggml_cuda_init: found 1 CUDA devices:
599 Sep 25 13:46:03 ai-platform ollama[863]: Device 0: Tesla P40, compute capability 6.1, VMM: yes
600 Sep 25 13:46:04 ai-platform ollama[863]: llm_load_tensors: ggml ctx size = 0.68 MiB
601 Sep 25 13:48:53 ai-platform ollama[863]: llm_load_tensors: offloading 47 repeating layers to GPU
602 Sep 25 13:48:53 ai-platform ollama[863]: llm_load_tensors: offloaded 47/81 layers to GPU
603 Sep 25 13:48:53 ai-platform ollama[863]: llm_load_tensors: CPU buffer size = 38110.61 MiB
604 Sep 25 13:48:53 ai-platform ollama[863]: llm_load_tensors: CUDA0 buffer size = 21575.95 MiB
605 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: n_ctx = 2048
606 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: n_batch = 512
607 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: n_ubatch = 512
608 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: flash_attn = 0
609 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: freq_base = 500000.0
610 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: freq_scale = 1
611 Sep 25 13:48:56 ai-platform ollama[863]: llama_kv_cache_init: CUDA_Host KV buffer size = 264.00 MiB
612 Sep 25 13:48:56 ai-platform ollama[863]: llama_kv_cache_init: CUDA0 KV buffer size = 376.00 MiB
613 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
614 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
615 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
616 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
617 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: graph nodes = 2566
618 Sep 25 13:48:56 ai-platform ollama[863]: llama_new_context_with_model: graph splits = 433
619 Sep 25 13:48:57 ai-platform ollama[1842]: INFO [main] model loaded | tid="140313977905152" timestamp=1727297337
620 Sep 25 13:48:57 ai-platform ollama[863]: time=2024-09-25T13:48:57.670-07:00 level=INFO source=server.go:626 msg="llama runner started in 175.62 seconds"
621 Sep 25 13:48:57 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:48:57 | 200 | 2m56s | 127.0.0.1 | POST "/api/generate"
622 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.808-07:00 level=INFO source=server.go:103 msg="system memory" total="62.8 GiB" free="61.6 GiB" free_swap="8.0 GiB"
623 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.809-07:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=47 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="23.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[23.4 GiB]" memory.weights.total="36.5 GiB" memory.weights.repeating="35.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"
624 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.811-07:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2431661027/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 1 --port 44643"
625 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.811-07:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
626 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.811-07:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
627 Sep 25 13:57:36 ai-platform ollama[863]: time=2024-09-25T13:57:36.811-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
628 Sep 25 13:57:36 ai-platform ollama[2354]: INFO [main] build info | build=10 commit="9225b05" tid="139803936231424" timestamp=1727297856
629 Sep 25 13:57:36 ai-platform ollama[2354]: INFO [main] system info | n_threads=14 n_threads_batch=14 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139803936231424" timestamp=1727297856 total_threads=14
630 Sep 25 13:57:36 ai-platform ollama[2354]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="13" port="44643" tid="139803936231424" timestamp=1727297856
631 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6439716a5b6269ac02585fa4b90ab622c28d9fa8d93772cc713414642ffa6efd (version GGUF V3 (latest))
632 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
633 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 0: general.architecture str = llama
634 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 1: general.type str = model
635 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct
636 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 3: general.finetune str = Instruct
637 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
638 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 5: general.size_label str = 70B
639 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 6: general.license str = llama3.1
640 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
641 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
642 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 9: llama.block_count u32 = 80
643 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 10: llama.context_length u32 = 131072
644 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192
645 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672
646 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64
647 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
648 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
649 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
650 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 17: general.file_type u32 = 2
651 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
652 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
653 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
654 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
655 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
656 Sep 25 13:57:36 ai-platform ollama[863]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
657 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
658 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
659 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
660 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
661 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
662 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - type f32: 162 tensors
663 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - type q4_0: 561 tensors
664 Sep 25 13:57:37 ai-platform ollama[863]: llama_model_loader: - type q6_K: 1 tensors
665 Sep 25 13:57:37 ai-platform ollama[863]: time=2024-09-25T13:57:37.063-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
666 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_vocab: special tokens cache size = 256
667 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_vocab: token to piece cache size = 0.7999 MB
668 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: format = GGUF V3 (latest)
669 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: arch = llama
670 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: vocab type = BPE
671 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_vocab = 128256
672 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_merges = 280147
673 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: vocab_only = 0
674 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_ctx_train = 131072
675 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_embd = 8192
676 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_layer = 80
677 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_head = 64
678 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_head_kv = 8
679 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_rot = 128
680 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_swa = 0
681 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_k = 128
682 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_embd_head_v = 128
683 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_gqa = 8
684 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_embd_k_gqa = 1024
685 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_embd_v_gqa = 1024
686 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: f_norm_eps = 0.0e+00
687 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
688 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
689 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
690 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: f_logit_scale = 0.0e+00
691 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_ff = 28672
692 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_expert = 0
693 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_expert_used = 0
694 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: causal attn = 1
695 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: pooling type = 0
696 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: rope type = 0
697 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: rope scaling = linear
698 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: freq_base_train = 500000.0
699 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: freq_scale_train = 1
700 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: n_ctx_orig_yarn = 131072
701 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: rope_finetuned = unknown
702 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: ssm_d_conv = 0
703 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: ssm_d_inner = 0
704 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: ssm_d_state = 0
705 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_rank = 0
706 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: ssm_dt_b_c_rms = 0
707 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: model type = 70B
708 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: model ftype = Q4_0
709 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: model params = 70.55 B
710 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW)
711 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct
712 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
713 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
714 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: LF token = 128 'Ä'
715 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
716 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_print_meta: max token length = 256
717 Sep 25 13:57:37 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
718 Sep 25 13:57:37 ai-platform ollama[863]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
719 Sep 25 13:57:37 ai-platform ollama[863]: ggml_cuda_init: found 1 CUDA devices:
720 Sep 25 13:57:37 ai-platform ollama[863]: Device 0: Tesla P40, compute capability 6.1, VMM: yes
721 Sep 25 13:57:37 ai-platform ollama[863]: llm_load_tensors: ggml ctx size = 0.68 MiB
722 Sep 25 13:57:38 ai-platform ollama[863]: time=2024-09-25T13:57:38.519-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
723 Sep 25 13:57:39 ai-platform ollama[863]: llm_load_tensors: offloading 47 repeating layers to GPU
724 Sep 25 13:57:39 ai-platform ollama[863]: llm_load_tensors: offloaded 47/81 layers to GPU
725 Sep 25 13:57:39 ai-platform ollama[863]: llm_load_tensors: CPU buffer size = 38110.61 MiB
726 Sep 25 13:57:39 ai-platform ollama[863]: llm_load_tensors: CUDA0 buffer size = 21575.95 MiB
727 Sep 25 13:57:39 ai-platform ollama[863]: time=2024-09-25T13:57:39.221-07:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
728 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: n_ctx = 2048
729 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: n_batch = 512
730 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: n_ubatch = 512
731 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: flash_attn = 0
732 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: freq_base = 500000.0
733 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: freq_scale = 1
734 Sep 25 13:57:41 ai-platform ollama[863]: llama_kv_cache_init: CUDA_Host KV buffer size = 264.00 MiB
735 Sep 25 13:57:41 ai-platform ollama[863]: llama_kv_cache_init: CUDA0 KV buffer size = 376.00 MiB
736 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
737 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB
738 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB
739 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
740 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: graph nodes = 2566
741 Sep 25 13:57:41 ai-platform ollama[863]: llama_new_context_with_model: graph splits = 433
742 Sep 25 13:57:42 ai-platform ollama[2354]: INFO [main] model loaded | tid="139803936231424" timestamp=1727297862
743 Sep 25 13:57:42 ai-platform ollama[863]: time=2024-09-25T13:57:42.760-07:00 level=INFO source=server.go:626 msg="llama runner started in 5.95 seconds"
744 Sep 25 13:57:50 ai-platform ollama[863]: [GIN] 2024/09/25 - 13:57:50 | 200 | 14.16960451s | 127.0.0.1 | POST "/api/chat"
@dhiltgen commented on GitHub (Sep 25, 2024):
To clarify, I believe you're trying to load a larger model that needs to span the 2 GPUS, and we're failing to do so. Is that correct? If so, then I think I understand the problem.
The M6000 is a Compute Capability 5.2 which requires CUDA v11. The P40 is a 6.1, which can leverage v12. We probably have a bug where we're not falling back to the lowest common denominator CUDA library. To work around this, try setting
OLLAMA_LLM_LIBRARY=cuda_v11to force it to use that runner, and I believe it should start working on the 2 GPUs.@Blake110 commented on GitHub (Sep 26, 2024):
@dhiltgen Thank you for your time and reply.
yeah, I'm trying to run llama3.1:70b locallly in Ubuntu 22.04
I follow your guide to add OLLAMA_LLM_LIBRARY=cuda_v11 into "ollama.service", but it doesn't work. the Ollama still running with p40 compute v6.1.
I'm not suer I'm right or not. below is my ollama.service configuration.
/etc/systemd/system/ollama.service
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
OLLAMA_LLM_LIBRARY=cuda_v11
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0" "PATH=/home/user/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/>
CUDA_VISIBLE_DEVICES=0,1
[Install]
WantedBy=default.target
and I also tried add
Environment="OLLAMA_LLM_LIBRARY=cuda_v11"
doesn't work too.
@taco-q commented on GitHub (Sep 26, 2024):
I have the same problem.
My GPUs are 3090x2 and M6000x1 running Ollama 0.3.12, but M6000 is not used.
Attached is an excerpt from the server logs.
Sep 26 11:09:00 h11ssl-i systemd[1]: Started Ollama Service.
Sep 26 11:09:01 h11ssl-i ollama[25305]: 2024/09/26 11:09:01 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:cuda_v11 OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Sep 26 11:09:01 h11ssl-i ollama[25305]: time=2024-09-26T11:09:01.004Z level=INFO source=images.go:753 msg="total blobs: 23"
Sep 26 11:09:01 h11ssl-i ollama[25305]: time=2024-09-26T11:09:01.005Z level=INFO source=images.go:760 msg="total unused blobs removed: 0"
Sep 26 11:09:01 h11ssl-i ollama[25305]: time=2024-09-26T11:09:01.006Z level=INFO source=routes.go:1200 msg="Listening on 127.0.0.1:11434 (version 0.3.12)"
Sep 26 11:09:01 h11ssl-i ollama[25305]: time=2024-09-26T11:09:01.006Z level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2432595780/runners
Sep 26 11:09:17 h11ssl-i ollama[25305]: time=2024-09-26T11:09:17.463Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v11 cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2]"
Sep 26 11:09:17 h11ssl-i ollama[25305]: time=2024-09-26T11:09:17.463Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
Sep 26 11:09:18 h11ssl-i ollama[25305]: time=2024-09-26T11:09:18.240Z level=INFO source=types.go:107 msg="inference compute" id=GPU-484f7983-dd05-2e00-78c1-bc181e698055 library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
Sep 26 11:09:18 h11ssl-i ollama[25305]: time=2024-09-26T11:09:18.240Z level=INFO source=types.go:107 msg="inference compute" id=GPU-c5f8f968-5653-9187-5bdb-c8931d908436 library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
Sep 26 11:09:18 h11ssl-i ollama[25305]: time=2024-09-26T11:09:18.240Z level=INFO source=types.go:107 msg="inference compute" id=GPU-1033f84d-2926-787c-21cd-765120fc5009 library=cuda variant=v11 compute=5.2 driver=12.6 name="Quadro M6000 24GB" total="23.9 GiB" available="23.8 GiB"
Sep 26 11:10:02 h11ssl-i ollama[25305]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
Sep 26 11:10:02 h11ssl-i ollama[25305]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 26 11:10:02 h11ssl-i ollama[25305]: ggml_cuda_init: found 2 CUDA devices:
Sep 26 11:10:02 h11ssl-i ollama[25305]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Sep 26 11:10:02 h11ssl-i ollama[25305]: Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Sep 26 11:10:02 h11ssl-i ollama[25305]: llm_load_tensors: ggml ctx size = 1.27 MiB
Sep 26 11:10:03 h11ssl-i ollama[25305]: time=2024-09-26T11:10:03.288Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
Sep 26 11:10:08 h11ssl-i ollama[25305]: time=2024-09-26T11:10:08.956Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
Sep 26 11:10:09 h11ssl-i ollama[25305]: llm_load_tensors: offloading 67 repeating layers to GPU
Sep 26 11:10:09 h11ssl-i ollama[25305]: llm_load_tensors: offloaded 67/81 layers to GPU
Sep 26 11:10:09 h11ssl-i ollama[25305]: llm_load_tensors: CPU buffer size = 51919.44 MiB
Sep 26 11:10:09 h11ssl-i ollama[25305]: llm_load_tensors: CUDA0 buffer size = 20274.98 MiB
Sep 26 11:10:09 h11ssl-i ollama[25305]: llm_load_tensors: CUDA1 buffer size = 21377.71 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: n_ctx = 2048
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: n_batch = 512
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: n_ubatch = 512
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: flash_attn = 0
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: freq_base = 1000000.0
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: freq_scale = 1
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_kv_cache_init: CUDA_Host KV buffer size = 104.00 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_kv_cache_init: CUDA0 KV buffer size = 264.00 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_kv_cache_init: CUDA1 KV buffer size = 272.00 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: CUDA_Host output buffer size = 0.61 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: CUDA0 compute buffer size = 1287.53 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: CUDA1 compute buffer size = 324.00 MiB
Sep 26 11:10:14 h11ssl-i ollama[25305]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
@Blake110 commented on GitHub (Sep 27, 2024):
@taco-q Have you solved this issus? I even can not find gpu.go file in my system........
@taco-q commented on GitHub (Sep 27, 2024):
Wait for progress on the following pull request
https://github.com/ollama/ollama/pull/6983
@prusnak commented on GitHub (Feb 25, 2025):
Fixed with https://github.com/ollama/ollama/pull/8567