mirror of
https://github.com/ollama/ollama.git
synced 2026-05-07 08:30:05 -05:00
Closed
opened 2026-04-28 20:36:09 -05:00 by GiteaMirror
·
18 comments
No Branch/Tag Specified
main
hoyyeva/anthropic-local-image-path
dhiltgen/ci
dhiltgen/llama-runner
parth-remove-claude-desktop-launch
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.30.0-rc3
v0.30.0-rc2
v0.30.0-rc1
v0.30.0-rc0
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#51580
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @axil76 on GitHub (Dec 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7919
What is the issue?
I am testing the Vgpu on a Vsphere 8 cluster, the drivers work on the redhat 8 os and work in docker, when the VM boots, the Ollama server responds well and after several minutes, the ollama server no longer responds
Device 0: NVIDIA L40S-24C, compute capability 8.9, VMM: no
time=2024-12-03T14:30:07.963Z level=INFO source=server.go:593 msg="waiting for server to become available" satus="llm server loading model" and the service no longer responds and the nvidia-persistenced service is running
I don't understand where the problem comes from, when the card was mounted directly on the vm it worked
in docker
nvidia-smi
Tue Dec 3 14:37:24 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S-24C Off | 00000000:02:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 12571MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
ollama version
ollama version is 0.4.7
thanks for your answers.
OS
Docker
GPU
Nvidia
CPU
Intel
Ollama version
0.4.7
@rick-github commented on GitHub (Dec 3, 2024):
Adding full server logs will aid in debugging.
@axil76 commented on GitHub (Dec 3, 2024):
ollama.log
example of request after I have the message msg="waiting for server to become available" status="llm server loading model" and no response I have to restart the container
@rick-github commented on GitHub (Dec 3, 2024):
The server did become available after 36 seconds:
What client is accessing the model? If you add
OLLAMA_DEBUG=1to the server environment there might be something in the logs to indicate what is happening.@axil76 commented on GitHub (Dec 3, 2024):
ollama.log
with OLLAMA_DEBUG=1
the answer is very long... he writes a message every 10 seconds
now it's continue who makes the requests ollama
what is strange is that just after boot it works very well for a few minutes and then it doesn't work anymore, the performance deteriorates
@rick-github commented on GitHub (Dec 3, 2024):
What model are you using? The
GETevery 10 seconds is something outside of the container doing (presumably) a health check. The log finishes right after the model was ready and the prompt was being processed, was it restarted at that point or did you leave off the end of the log because there was nothing interesting?@AdminOfOz commented on GitHub (Dec 3, 2024):
I can somewhat anecdotally confirm that I have experience a similar degradation of service where the initial boot of ollama works great, but after a few requests or after some time it does not work.
The only other log I'm getting "gpu VRAM usage didn't recover within timeout"
SIDE NOTE: I recently went through a large infrastructure change that might invalidate my report.
My Setup Previous:
I was previously using ollama in a docker container with a 4090 and I believe it was the previouis version of ollama.
My Current Setup:
I took the same hardware and converted it so that I'm now using a GPU passthrough in proxmox. I also upgraded to the current version of ollama (and no longer via docker) during this period so I cannot say if the change in performance was due to an upgrade in version or a change in passing through the GPU and quite frankly it was too much work to get GPU pass through working.
Current nvidia smi:
Process runs in a terminal so not great logging... I know... I know. I did see this error:
gpu VRAM usage didn't recover │INFO: 192.168.1.30:55030 - "POST /ollama/ap
within timeout" seconds=6.*
@rick-github commented on GitHub (Dec 4, 2024):
Either there's no model loaded, or ollama is not using the GPU.
Ollama doesn't have any endpoints that start with "/ollama".
I know it's in a terminal, but logs would be required for any debugging.
@axil76 commented on GitHub (Dec 4, 2024):
There was not much in the log, on the other hand I use the vgrid drivers which corresponds to the version installed on the esxi, I don't know what comes from that.
@rick-github commented on GitHub (Dec 12, 2024):
From #8023, it's possible the performance decline from the original post is a licensing issue. What's the output of
nvidia-smi -q?@clduab11 commented on GitHub (Dec 23, 2024):
For what it's worth...
I'm running 12.7 CUDA Version (I think nvidia's site said my vGPU doesn't support license system; though running my
nvidia-smi -qdidn't even showcase any vGPU information for me.I wanted to throw my hat in the ring and say I'm having very wonky inference times whereas in previous versions I did not, and wondered if this issue may have been related. I'll do my best to provide full logs... I launch with docker-compose.yaml (but don't have the debug mode in my .yaml unfortunately)...
This is one such example for time to first token (over 5 minutes). The llamarunner index took a long time... I will fit all the logs I can from first to last...I know my Pipelines throws an error but it isn't related to the poor inference. This happens with or without pipelines in my configuration.
2024-12-22 19:32:21 open-webui | {'model': 'hf.co/mradermacher/HomerCreativeAnvita-Mix-Qw7B-i1-GGUF:Q6_K', 'messages': [{'role': 'user', 'content': '### Task:\nYou are an autocompletion system. Continue the text inbased on the **completion type** inand the given language. \n\n### **Instructions**:\n1. Analyzefor context and meaning. \n2. Useto guide your output: \n - **General**: Provide a natural, concise continuation. \n - **Search Query**: Complete as if generating a realistic search query. \n3. Start as if you are directly continuing. Do **not** repeat, paraphrase, or respond as a model. Simply complete the text. \n4. Ensure the continuation:\n - Flows naturally from. \n - Avoids repetition, overexplaining, or unrelated ideas. \n5. If unsure, return:{ "text": "" }. \n\n### **Output Rules**:\n- Respond only in JSON format:{ "text": "<your_completion>" }.\n\n### **Examples**:\n#### Example 1: \nInput: \n<type>General</type> \n<text>The sun was setting over the horizon, painting the sky</text> \nOutput: \n{ "text": "with vibrant shades of orange and pink." }\n\n#### Example 2: \nInput: \n<type>Search Query</type> \n<text>Top-rated restaurants in</text> \nOutput: \n{ "text": "New York City for Italian cuisine." } \n\n---\n### Context:\n<chat_history>\n\n</chat_history>\n<type>search query</type> \n<text>Homer, talk to me about </text> \n#### Output:\n'}], 'stream': False, 'metadata': {'task': 'autocomplete_generation', 'task_body': {'model': 'hf.co/mradermacher/HomerCreativeAnvita-Mix-Qw7B-i1-GGUF:Q6_K', 'prompt': 'Homer, talk to me about ', 'type': 'search query', 'stream': False}, 'chat_id': None}} 2024-12-22 19:32:21 open-webui | INFO: 127.0.0.1:57710 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.161Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.083582445 model=/root/.ollama/models/blobs/sha256-3b70c65c6448a92a2419fee421689daf69dc85e3df83e54aef73de319c1f4ff6 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.411Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.333496096 model=/root/.ollama/models/blobs/sha256-3b70c65c6448a92a2419fee421689daf69dc85e3df83e54aef73de319c1f4ff6 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.626Z level=INFO source=server.go:104 msg="system memory" total="23.4 GiB" free="16.2 GiB" free_swap="6.0 GiB" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.626Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=28 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.2 GiB" memory.required.partial="6.8 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.626Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d --ctx-size 8192 --batch-size 512 --n-gpu-layers 28 --threads 8 --parallel 1 --port 33289" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.627Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.627Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.627Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.647Z level=INFO source=runner.go:945 msg="starting go runner" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.661Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.583968982 model=/root/.ollama/models/blobs/sha256-3b70c65c6448a92a2419fee421689daf69dc85e3df83e54aef73de319c1f4ff6 2024-12-22 19:32:24 ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2024-12-22 19:32:24 ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2024-12-22 19:32:24 ollama | ggml_cuda_init: found 1 CUDA devices: 2024-12-22 19:32:24 ollama | Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.680Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.680Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:33289" 2024-12-22 19:32:24 ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 7065 MiB free 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.878Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2024-12-22 19:32:24 ollama | llama_model_loader: loaded meta data with 46 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d (version GGUF V3 (latest)) 2024-12-22 19:32:24 ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 0: general.architecture str = qwen2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 1: general.type str = model 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 2: general.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 3: general.organization str = ZeroXClem 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 4: general.finetune str = HomerCreative-Mix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 5: general.basename str = Qwen2.5 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 6: general.size_label str = 7B 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 7: general.base_model.count u32 = 2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 7B HomerAnvita NerdMix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 9: general.base_model.0.organization str = ZeroXClem 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 11: general.base_model.1.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 12: general.base_model.1.organization str = ZeroXClem 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 13: general.base_model.1.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 14: general.tags arr[str,2] = ["mergekit", "merge"] 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 15: qwen2.block_count u32 = 28 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 23: general.file_type u32 = 18 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 34: general.quantization_version u32 = 2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/H... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 36: mradermacher.quantize_version str = 2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 38: mradermacher.quantized_at str = 2024-11-23T07:32:30+01:00 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 39: mradermacher.quantized_on str = db2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/suayptalha/Hom... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 41: mradermacher.convert_type str = hf 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 42: quantize.imatrix.file str = HomerCreativeAnvita-Mix-Qw7B-i1-GGUF/... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 43: quantize.imatrix.dataset str = imatrix-training-full-3 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 44: quantize.imatrix.entries_count i32 = 196 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 45: quantize.imatrix.chunks_count i32 = 318 2024-12-22 19:32:24 ollama | llama_model_loader: - type f32: 141 tensors 2024-12-22 19:32:24 ollama | llama_model_loader: - type q6_K: 198 tensors 2024-12-22 19:32:25 ollama | llm_load_vocab: special tokens cache size = 22 2024-12-22 19:32:25 ollama | llm_load_vocab: token to piece cache size = 0.9310 MB 2024-12-22 19:32:25 ollama | llm_load_print_meta: format = GGUF V3 (latest) 2024-12-22 19:32:25 ollama | llm_load_print_meta: arch = qwen2 2024-12-22 19:32:25 ollama | llm_load_print_meta: vocab type = BPE 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_vocab = 152064 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_merges = 151387 2024-12-22 19:32:25 ollama | llm_load_print_meta: vocab_only = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_ctx_train = 32768 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd = 3584 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_layer = 28 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_head = 28 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_head_kv = 4 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_rot = 128 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_swa = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_head_k = 128 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_head_v = 128 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_gqa = 7 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_k_gqa = 512 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_v_gqa = 512 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_ff = 18944 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_expert = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_expert_used = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: causal attn = 1 2024-12-22 19:32:25 ollama | llm_load_print_meta: pooling type = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: rope type = 2 2024-12-22 19:32:25 ollama | llm_load_print_meta: rope scaling = linear 2024-12-22 19:32:25 ollama | llm_load_print_meta: freq_base_train = 1000000.0 2024-12-22 19:32:25 ollama | llm_load_print_meta: freq_scale_train = 1 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_ctx_orig_yarn = 32768 2024-12-22 19:32:25 ollama | llm_load_print_meta: rope_finetuned = unknown 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_d_conv = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_d_inner = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_d_state = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_dt_rank = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: model type = 7B 2024-12-22 19:32:25 ollama | llm_load_print_meta: model ftype = Q6_K 2024-12-22 19:32:25 ollama | llm_load_print_meta: model params = 7.62 B 2024-12-22 19:32:25 ollama | llm_load_print_meta: model size = 5.82 GiB (6.56 BPW) 2024-12-22 19:32:25 ollama | llm_load_print_meta: general.name = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:32:25 ollama | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOS token = 151645 '<|im_end|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: LF token = 148848 'ÄĬ' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151645 '<|im_end|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: max token length = 256 2024-12-22 19:32:25 open-webui | INFO: 172.18.0.1:57524 - "POST /api/v1/chats/new HTTP/1.1" 200 OK 2024-12-22 19:32:25 open-webui | INFO: 172.18.0.1:57524 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200 OK 2024-12-22 19:32:25 open-webui | INFO [open_webui.apps.openai.main] get_all_models() 2024-12-22 19:32:25 pipelines | INFO: 172.18.0.1:56674 - "GET /models HTTP/1.1" 200 OK 2024-12-22 19:32:26 open-webui | INFO [open_webui.apps.ollama.main] get_all_models() 2024-12-22 19:32:26 ollama | [GIN] 2024/12/23 - 01:32:26 | 200 | 77.050597ms | 172.18.0.5 | GET "/api/tags" 2024-12-22 19:32:27 pipelines | pipe:blueprints.function_calling_blueprint 2024-12-22 19:32:27 pipelines | {'id': '9b978aa8-4155-47c3-9e9a-4dac714d9078', 'email': 'chrisldukes@gmail.com', 'name': 'Chris Dukes', 'role': 'admin'} 2024-12-22 19:32:27 pipelines | Error: 400 Client Error: Bad Request for url: https://api.openai.com/v1/chat/completions 2024-12-22 19:32:27 pipelines | INFO: 172.18.0.1:37230 - "POST /function_calling_scaffold/filter/inlet HTTP/1.1" 200 OK 2024-12-22 19:32:28 open-webui | INFO [open_webui.apps.ollama.main] url: http://ollama:11434 2024-12-22 19:32:51 open-webui | Fetching models from https://api.mistral.ai/v1/models 2024-12-22 19:32:51 open-webui | INFO: 127.0.0.1:42742 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:33:21 open-webui | INFO: 127.0.0.1:36904 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:33:51 open-webui | INFO: 127.0.0.1:58296 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:34:21 open-webui | INFO: 127.0.0.1:60204 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:34:51 open-webui | INFO: 127.0.0.1:43350 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:35:11 ollama | llm_load_tensors: offloading 28 repeating layers to GPU 2024-12-22 19:35:11 ollama | llm_load_tensors: offloaded 28/29 layers to GPU 2024-12-22 19:35:11 ollama | llm_load_tensors: CPU_Mapped model buffer size = 852.73 MiB 2024-12-22 19:35:11 ollama | llm_load_tensors: CUDA0 model buffer size = 5106.06 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_seq_max = 1 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ctx = 8192 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ctx_per_seq = 8192 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_batch = 512 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ubatch = 512 2024-12-22 19:35:12 ollama | llama_new_context_with_model: flash_attn = 0 2024-12-22 19:35:12 ollama | llama_new_context_with_model: freq_base = 1000000.0 2024-12-22 19:35:12 ollama | llama_new_context_with_model: freq_scale = 1 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized 2024-12-22 19:35:12 ollama | llama_kv_cache_init: CUDA0 KV buffer size = 448.00 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: CPU output buffer size = 0.59 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: CUDA0 compute buffer size = 730.36 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 23.01 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: graph nodes = 986 2024-12-22 19:35:12 ollama | llama_new_context_with_model: graph splits = 4 (with bs=512), 3 (with bs=1) 2024-12-22 19:35:13 ollama | time=2024-12-23T01:35:13.002Z level=INFO source=server.go:594 msg="llama runner started in 168.39 seconds" 2024-12-22 19:35:15 ollama | [GIN] 2024/12/23 - 01:35:15 | 200 | 2m56s | 172.18.0.5 | POST "/api/chat" 2024-12-22 19:35:15 open-webui | INFO: 172.18.0.1:57508 - "POST /api/task/auto/completions HTTP/1.1" 200 OK 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.219Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.159822309 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.469Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.4100184989999995 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.719Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.659698822 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.743Z level=INFO source=server.go:104 msg="system memory" total="23.4 GiB" free="16.2 GiB" free_swap="6.0 GiB" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.744Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=145 layers.model=29 layers.offload=28 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.2 GiB" memory.required.partial="6.8 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.747Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d --ctx-size 8192 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 39149" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.748Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.748Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.748Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.904Z level=INFO source=runner.go:945 msg="starting go runner" 2024-12-22 19:35:20 ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2024-12-22 19:35:20 ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2024-12-22 19:35:20 ollama | ggml_cuda_init: found 1 CUDA devices: 2024-12-22 19:35:20 ollama | Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.976Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=128 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.976Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:39149" 2024-12-22 19:35:21 ollama | time=2024-12-23T01:35:21.000Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2024-12-22 19:35:21 ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 7065 MiB free 2024-12-22 19:35:21 ollama | llama_model_loader: loaded meta data with 46 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d (version GGUF V3 (latest)) 2024-12-22 19:35:21 ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 0: general.architecture str = qwen2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 1: general.type str = model 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 2: general.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 3: general.organization str = ZeroXClem 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 4: general.finetune str = HomerCreative-Mix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 5: general.basename str = Qwen2.5 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 6: general.size_label str = 7B 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 7: general.base_model.count u32 = 2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 7B HomerAnvita NerdMix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 9: general.base_model.0.organization str = ZeroXClem 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 11: general.base_model.1.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 12: general.base_model.1.organization str = ZeroXClem 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 13: general.base_model.1.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 14: general.tags arr[str,2] = ["mergekit", "merge"] 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 15: qwen2.block_count u32 = 28 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 23: general.file_type u32 = 18 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 34: general.quantization_version u32 = 2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/H... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 36: mradermacher.quantize_version str = 2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 38: mradermacher.quantized_at str = 2024-11-23T07:32:30+01:00 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 39: mradermacher.quantized_on str = db2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/suayptalha/Hom... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 41: mradermacher.convert_type str = hf 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 42: quantize.imatrix.file str = HomerCreativeAnvita-Mix-Qw7B-i1-GGUF/... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 43: quantize.imatrix.dataset str = imatrix-training-full-3 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 44: quantize.imatrix.entries_count i32 = 196 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 45: quantize.imatrix.chunks_count i32 = 318 2024-12-22 19:35:21 ollama | llama_model_loader: - type f32: 141 tensors 2024-12-22 19:35:21 ollama | llama_model_loader: - type q6_K: 198 tensors 2024-12-22 19:35:21 open-webui | INFO: 127.0.0.1:55554 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:35:21 ollama | llm_load_vocab: special tokens cache size = 22 2024-12-22 19:35:21 ollama | llm_load_vocab: token to piece cache size = 0.9310 MB 2024-12-22 19:35:21 ollama | llm_load_print_meta: format = GGUF V3 (latest) 2024-12-22 19:35:21 ollama | llm_load_print_meta: arch = qwen2 2024-12-22 19:35:21 ollama | llm_load_print_meta: vocab type = BPE 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_vocab = 152064 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_merges = 151387 2024-12-22 19:35:21 ollama | llm_load_print_meta: vocab_only = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_ctx_train = 32768 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd = 3584 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_layer = 28 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_head = 28 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_head_kv = 4 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_rot = 128 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_swa = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_head_k = 128 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_head_v = 128 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_gqa = 7 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_k_gqa = 512 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_v_gqa = 512 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_ff = 18944 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_expert = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_expert_used = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: causal attn = 1 2024-12-22 19:35:21 ollama | llm_load_print_meta: pooling type = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: rope type = 2 2024-12-22 19:35:21 ollama | llm_load_print_meta: rope scaling = linear 2024-12-22 19:35:21 ollama | llm_load_print_meta: freq_base_train = 1000000.0 2024-12-22 19:35:21 ollama | llm_load_print_meta: freq_scale_train = 1 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_ctx_orig_yarn = 32768 2024-12-22 19:35:21 ollama | llm_load_print_meta: rope_finetuned = unknown 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_d_conv = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_d_inner = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_d_state = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_dt_rank = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: model type = 7B 2024-12-22 19:35:21 ollama | llm_load_print_meta: model ftype = Q6_K 2024-12-22 19:35:21 ollama | llm_load_print_meta: model params = 7.62 B 2024-12-22 19:35:21 ollama | llm_load_print_meta: model size = 5.82 GiB (6.56 BPW) 2024-12-22 19:35:21 ollama | llm_load_print_meta: general.name = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:35:21 ollama | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOS token = 151645 '<|im_end|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: LF token = 148848 'ÄĬ' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151645 '<|im_end|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: max token length = 256 2024-12-22 19:35:51 open-webui | INFO: 127.0.0.1:36152 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:36:21 open-webui | INFO: 127.0.0.1:58318 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:36:51 open-webui | INFO: 127.0.0.1:46402 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:37:21 open-webui | INFO: 127.0.0.1:39014 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:37:22 ollama | llm_load_tensors: offloading 28 repeating layers to GPU 2024-12-22 19:37:22 ollama | llm_load_tensors: offloading output layer to GPU 2024-12-22 19:37:22 ollama | llm_load_tensors: offloaded 29/29 layers to GPU 2024-12-22 19:37:22 ollama | llm_load_tensors: CPU_Mapped model buffer size = 426.36 MiB 2024-12-22 19:37:22 ollama | llm_load_tensors: CUDA0 model buffer size = 5532.43 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_seq_max = 1 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ctx = 8192 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ctx_per_seq = 8192 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_batch = 512 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ubatch = 512 2024-12-22 19:37:24 ollama | llama_new_context_with_model: flash_attn = 0 2024-12-22 19:37:24 ollama | llama_new_context_with_model: freq_base = 1000000.0 2024-12-22 19:37:24 ollama | llama_new_context_with_model: freq_scale = 1 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized 2024-12-22 19:37:24 ollama | llama_kv_cache_init: CUDA0 KV buffer size = 448.00 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: CUDA_Host output buffer size = 0.59 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: CUDA0 compute buffer size = 492.00 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 23.01 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: graph nodes = 986 2024-12-22 19:37:24 ollama | llama_new_context_with_model: graph splits = 2 2024-12-22 19:37:24 ollama | time=2024-12-23T01:37:24.223Z level=INFO source=server.go:594 msg="llama runner started in 123.48 seconds" 2024-12-22 19:37:24 open-webui | INFO: 172.18.0.1:57524 - "POST /ollama/api/chat HTTP/1.1" 200 OK 2024-12-22 19:37:31 ollama | [GIN] 2024/12/23 - 01:37:31 | 200 | 5m3s | 172.18.0.5 | POST "/api/chat" 2024-12-22 19:37:31 open-webui | INFO: 172.18.0.1:57524 - "POST /api/v1/chats/bfc9c299-7266-4e71-b434-ed75b9ee3c5a HTTP/1.1" 200 OK 2024-12-22 19:37:31 open-webui | INFO: 172.18.0.1:57524 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200 OK 2024-12-22 19:37:31 open-webui | INFO [open_webui.apps.openai.main] get_all_models() 2024-12-22 19:37:31 pipelines | INFO: 172.18.0.1:49478 - "GET /models HTTP/1.1" 200 OK 2024-12-22 19:37:32 open-webui | INFO [open_webui.apps.ollama.main] get_all_models() 2024-12-22 19:37:32 ollama | [GIN] 2024/12/23 - 01:37:32 | 200 | 63.627221ms | 172.18.0.5 | GET "/api/tags" 2024-12-22 19:37:34 pipelines | INFO: 172.18.0.1:49494 - "POST /function_calling_scaffold/filter/outlet HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | Fetching models from https://api.mistral.ai/v1/models 2024-12-22 19:37:34 open-webui | <Encoding 'o200k_base'> 2024-12-22 19:37:34 open-webui | INFO: 172.18.0.1:57524 - "POST /api/chat/completed HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | INFO: 172.18.0.1:57524 - "POST /api/v1/chats/bfc9c299-7266-4e71-b434-ed75b9ee3c5a HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | INFO: 172.18.0.1:57524 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200 OK 2024-12-22 19:37:34 pipelines | pipe:blueprints.function_calling_blueprint 2024-12-22 19:37:34 pipelines | {'id': '9b978aa8-4155-47c3-9e9a-4dac714d9078', 'email': 'chrisldukes@gmail.com', 'name': 'Chris Dukes', 'role': 'admin'} 2024-12-22 19:37:34 pipelines | Error: 400 Client Error: Bad Request for url: https://api.openai.com/v1/chat/completions 2024-12-22 19:37:34 pipelines | INFO: 172.18.0.1:49500 - "POST /function_calling_scaffold/filter/inlet HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | INFO [open_webui.apps.ollama.main] url: http://ollama:11434 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.076Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.190654251 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.326Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.440366167 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.576Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.690480729 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.580Z level=INFO source=server.go:104 msg="system memory" total="23.4 GiB" free="16.2 GiB" free_swap="6.0 GiB" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.580Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=28 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.2 GiB" memory.required.partial="6.8 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.584Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d --ctx-size 8192 --batch-size 512 --n-gpu-layers 28 --threads 8 --parallel 1 --port 46047" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.585Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.585Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.585Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.749Z level=INFO source=runner.go:945 msg="starting go runner" 2024-12-22 19:37:40 ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2024-12-22 19:37:40 ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2024-12-22 19:37:40 ollama | ggml_cuda_init: found 1 CUDA devices: 2024-12-22 19:37:40 ollama | Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.823Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.823Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:46047" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.837Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2024-12-22 19:37:40 ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 7065 MiB free 2024-12-22 19:37:41 ollama | llama_model_loader: loaded meta data with 46 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d (version GGUF V3 (latest)) 2024-12-22 19:37:41 ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 0: general.architecture str = qwen2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 1: general.type str = model 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 2: general.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 3: general.organization str = ZeroXClem 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 4: general.finetune str = HomerCreative-Mix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 5: general.basename str = Qwen2.5 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 6: general.size_label str = 7B 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 7: general.base_model.count u32 = 2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 7B HomerAnvita NerdMix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 9: general.base_model.0.organization str = ZeroXClem 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 11: general.base_model.1.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 12: general.base_model.1.organization str = ZeroXClem 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 13: general.base_model.1.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 14: general.tags arr[str,2] = ["mergekit", "merge"] 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 15: qwen2.block_count u32 = 28 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 23: general.file_type u32 = 18 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 34: general.quantization_version u32 = 2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/H... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 36: mradermacher.quantize_version str = 2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 38: mradermacher.quantized_at str = 2024-11-23T07:32:30+01:00 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 39: mradermacher.quantized_on str = db2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/suayptalha/Hom... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 41: mradermacher.convert_type str = hf 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 42: quantize.imatrix.file str = HomerCreativeAnvita-Mix-Qw7B-i1-GGUF/... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 43: quantize.imatrix.dataset str = imatrix-training-full-3 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 44: quantize.imatrix.entries_count i32 = 196 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 45: quantize.imatrix.chunks_count i32 = 318 2024-12-22 19:37:41 ollama | llama_model_loader: - type f32: 141 tensors 2024-12-22 19:37:41 ollama | llama_model_loader: - type q6_K: 198 tensors 2024-12-22 19:37:41 ollama | llm_load_vocab: special tokens cache size = 22 2024-12-22 19:37:41 ollama | llm_load_vocab: token to piece cache size = 0.9310 MB 2024-12-22 19:37:41 ollama | llm_load_print_meta: format = GGUF V3 (latest) 2024-12-22 19:37:41 ollama | llm_load_print_meta: arch = qwen2 2024-12-22 19:37:41 ollama | llm_load_print_meta: vocab type = BPE 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_vocab = 152064 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_merges = 151387 2024-12-22 19:37:41 ollama | llm_load_print_meta: vocab_only = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_ctx_train = 32768 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd = 3584 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_layer = 28 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_head = 28 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_head_kv = 4 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_rot = 128 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_swa = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_head_k = 128 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_head_v = 128 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_gqa = 7 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_k_gqa = 512 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_v_gqa = 512 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_ff = 18944 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_expert = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_expert_used = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: causal attn = 1 2024-12-22 19:37:41 ollama | llm_load_print_meta: pooling type = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: rope type = 2 2024-12-22 19:37:41 ollama | llm_load_print_meta: rope scaling = linear 2024-12-22 19:37:41 ollama | llm_load_print_meta: freq_base_train = 1000000.0 2024-12-22 19:37:41 ollama | llm_load_print_meta: freq_scale_train = 1 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_ctx_orig_yarn = 32768 2024-12-22 19:37:41 ollama | llm_load_print_meta: rope_finetuned = unknown 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_d_conv = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_d_inner = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_d_state = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_dt_rank = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: model type = 7B 2024-12-22 19:37:41 ollama | llm_load_print_meta: model ftype = Q6_K 2024-12-22 19:37:41 ollama | llm_load_print_meta: model params = 7.62 B 2024-12-22 19:37:41 ollama | llm_load_print_meta: model size = 5.82 GiB (6.56 BPW) 2024-12-22 19:37:41 ollama | llm_load_print_meta: general.name = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:37:41 ollama | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOS token = 151645 '<|im_end|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: LF token = 148848 'ÄĬ' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151645 '<|im_end|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: max token length = 256Parts of logs I feel could be relevant to my noob eyes?
2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.219Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.159822309 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.469Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.4100184989999995 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.719Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.659698822 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2dI also thought somewhere I saw the llamarunner take an inordinate amount of time?
I'm not sure if any of this is helpful or not, but I've been racking my brains trying to figure it out. As I said at the top, I run my configuration through a .yaml in Docker Compose. I can upload the .yaml if it's helpful.
@rick-github commented on GitHub (Dec 23, 2024):
Could you either add block markdown markers around the logs (```) or add the logs as an attachment, it's very difficult to parse the logs as they are.
@rick-github commented on GitHub (Dec 23, 2024):
Actually, I see that there is some sort of block, it seems that the text inside is badly formatted. Adding as an attachment would help a lot.
@rick-github commented on GitHub (Dec 23, 2024):
From your screenshot, the model generated a very respectable 36 tokens per second, but the overall response took just over 5 minutes. Logs will show for sure, but this seems like most of this time was spent loading the model. If you haven't set
OLLAMA_KEEP_ALIVEin your docker compose file, then ollama will unload a model after 5 minutes of inactivity. This may lead to the "wonky inference times" you mention - the first inference takes 5 minutes because the model needs to load, a second inference takes 7 seconds, you leave it for 10 minutes and the third inferences takes 5 minutes because the model has been evicted.@clduab11 commented on GitHub (Dec 23, 2024):
Thanks so much for the response @rick-github! My apologies, I'm definitely still very new to GitHub so I'll try to make this easier...
First of all, I'm sure this doesn't have a lot to do with it...but my Watchtower is included in my .yaml, and this is the current Ollama version I have...
Otherwise, you're absolutely correct; it's definitely the model loading that's taking the longest. I wish I had prior logs from older versions, but my initial model load used to never be very long in the 0.4.x version(s).
I've done some testing this AM and this is what I've noticed...
This is the data for the first inference, indicative of model load. However, when going to prompt my model a second time (immediately after the first output fully generated)... these were my results...
I noticed while watching my logs in Docker that it almost appeared as if... for lack of a better description, re-inferencing? It went through similar mechanisms twice before generating the second follow-up output (the screenshot directly above).
Here's some .txt's for my logs...One shows the logs between the 2nd input in -> 2nd output out, and one shows the full logs from the moment the first output generated -> 2nd output out (if that makes sense, so sorry if that's poorly phrased!)
second-inf-logs.txt
first-inference-to-end-of-2nd.txt
@rick-github commented on GitHub (Dec 23, 2024):
It's not re-inferencing, open-webui has a couple of features which result in multiple calls to an LLM API for a single inference. The first is the summary generation, where open-webui takes the first response in a session and asks the LLM to summarize it so that it can add it to the chat list on the left hand panel. The second is auto-complete, where open-webui takes the text you've typed in and asks the LLM to guess what you are going to type, to autocomplete the prompt.
I think these are playing into the delays you are seeing because the size of context window keeps changing:
I think what's happening is that you have a context window of 8192 configured somewhere (in the model with
PARAMETER num_ctxor in open-webui somewhere) and open-webui uses that for a completion, and then when it does it's secondary completion (summary or autocomplete or some other "helper" function) it uses the default context window (either explicitly with"options":{"num_ctx":2048}or implicitly by not settingnum_ctx). Unfortunately a change in context window results in a model eviction and immediate reload, which could cause the delays you are seeing - the actual completion finishes in seconds, but all the model unloading/loading around it makes it seem slow. I think you will have to poke around in the open-webui settings and either turn off these functions or configure them to use the same context window as the primary completion. There is an open PR which would alleviate this problem but it's not ready for integration yet.@clduab11 commented on GitHub (Dec 24, 2024):
Oh wow, and to think I had seen that earlier this morning and was like "hmm that's odd, I wonder why my num_ctx is set at 2048 for that..." and figured the OWUI interface just "overrode" it somehow. This makes perfect sense, and I super appreciate you going out of your way to help me with this! I will reach out to the folks on OWUI's end and see where I should be configuring this to help alleviate some the delay.
Thank you so so much! Ollama rocks!! :)
EDIT: Set the num_ctx at the model level (instead of at the system level) and disabled Autogeneration feature in OWUI, brought my initial load down by a substantial margin, and further prompts conversation-style to the model load as they should; woo! :)
Will eagerly await the next awesome update and the PR to be able to use the AutoComplete feature again without it evicting the model!
@tne-ops commented on GitHub (Jan 31, 2025):
Just a little confused the issue in ollama is closed but the ollama PR is still open after a month? It duesnt seem fixed? And causes very slow performance with open webui :-)
@rick-github commented on GitHub (Jan 31, 2025):
The performance decline was likely due to licensing issues, but the OP didn't respond to a request for more information, so this issue was closed as stale. The unrelated issue from a different poster would be resolved by the PR, but the ollama team are busy with other things. Feel free to open a new issue to highlight the need for the PR to be merged.