mirror of
https://github.com/ollama/ollama.git
synced 2026-05-07 08:30:05 -05:00
Closed
opened 2026-04-22 09:50:50 -05:00 by GiteaMirror
·
20 comments
No Branch/Tag Specified
main
hoyyeva/anthropic-local-image-path
dhiltgen/ci
dhiltgen/llama-runner
parth-remove-claude-desktop-launch
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.30.0-rc3
v0.30.0-rc2
v0.30.0-rc1
v0.30.0-rc0
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#30295
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @goactiongo on GitHub (Oct 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7146
Originally assigned to: @dhiltgen on GitHub.
What is the issue?
I have 4 GPU cards and which card is 24G

It's ok for recognizing short text content, but failed to recognize the long text content.
Model is qwen2.5:32b and ctx-size is setted to 30001 for handeling the long content, as followed
10月 09 12:37:41 gpu ollama[40766]: time=2024-10-09T12:37:41.871+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama273555 6946/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede 8bd1a41 --ctx-size 30001 --batch-size 512 --embedding --log-disable --n-gpu-layers 65 --parallel 1 --tensor-split 17,16,16,16 --port 41781"
AI debug information as followed.
ollama debug logs
10月 09 12:37:39 gpu ollama[40766]: time=2024-10-09T12:37:39.452+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx2 cuda_v11 cuda_v12 rocm_v60102 cpu cpu_avx]"
10月 09 12:37:39 gpu ollama[40766]: time=2024-10-09T12:37:39.453+08:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
10月 09 12:37:40 gpu ollama[40766]: time=2024-10-09T12:37:40.723+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="14.7 GiB"
10月 09 12:37:40 gpu ollama[40766]: time=2024-10-09T12:37:40.723+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB"
10月 09 12:37:40 gpu ollama[40766]: time=2024-10-09T12:37:40.723+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="19.4 GiB"
10月 09 12:37:40 gpu ollama[40766]: time=2024-10-09T12:37:40.723+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="20.0 GiB"
10月 09 12:37:41 gpu ollama[40766]: time=2024-10-09T12:37:41.854+08:00 level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 library=cuda parallel=1 required="40.6 GiB"
10月 09 12:37:41 gpu ollama[40766]: time=2024-10-09T12:37:41.854+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="112.7 GiB" free_swap="3.7 GiB"
10月 09 12:37:41 gpu ollama[40766]: time=2024-10-09T12:37:41.858+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=65 layers.split=17,16,16,16 memory.available="[20.0 GiB 19.4 GiB 16.8 GiB 14.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="40.6 GiB" memory.required.partial="40.6 GiB" memory.required.kv="7.3 GiB" memory.required.allocations="[10.6 GiB 10.0 GiB 10.0 GiB 10.0 GiB]" memory.weights.total="24.8 GiB" memory.weights.repeating="24.2 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="2.9 GiB" memory.graph.partial="2.9 GiB"
10月 09 12:37:41 gpu ollama[40766]: time=2024-10-09T12:37:41.871+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2735556946/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 --ctx-size 30001 --batch-size 512 --embedding --log-disable --n-gpu-layers 65 --parallel 1 --tensor-split 17,16,16,16 --port 41781"
10月 09 12:37:41 gpu ollama[40766]: time=2024-10-09T12:37:41.872+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
10月 09 12:37:41 gpu ollama[40766]: time=2024-10-09T12:37:41.872+08:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
10月 09 12:37:41 gpu ollama[40766]: time=2024-10-09T12:37:41.873+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
10月 09 12:37:41 gpu ollama[40766]: INFO [main] build info | build=10 commit="9225b05" tid="140652491755520" timestamp=1728448661
10月 09 12:37:41 gpu ollama[40766]: INFO [main] system info | n_threads=32 n_threads_batch=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140652491755520" timestamp=1728448661 total_threads=64
10月 09 12:37:41 gpu ollama[40766]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="41781" tid="140652491755520" timestamp=1728448661
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: loaded meta data with 34 key-value pairs and 771 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 (version GGUF V3 (latest))
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 0: general.architecture str = qwen2
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 1: general.type str = model
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 2: general.name str = Qwen2.5 32B Instruct
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 3: general.finetune str = Instruct
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 4: general.basename str = Qwen2.5
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 5: general.size_label str = 32B
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 6: general.license str = apache-2.0
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-3...
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 8: general.base_model.count u32 = 1
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 9: general.base_model.0.name str = Qwen2.5 32B
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 10: general.base_model.0.organization str = Qwen
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-32B
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 12: general.tags arr[str,2] = ["chat", "text-generation"]
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"]
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 14: qwen2.block_count u32 = 64
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 15: qwen2.context_length u32 = 32768
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 16: qwen2.embedding_length u32 = 5120
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 17: qwen2.feed_forward_length u32 = 27648
10月 09 12:37:41 gpu ollama[40766]: llama_model_loader: - kv 18: qwen2.attention.head_count u32 = 40
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - kv 19: qwen2.attention.head_count_kv u32 = 8
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - kv 20: qwen2.rope.freq_base f32 = 1000000.000000
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - kv 21: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - kv 22: general.file_type u32 = 15
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
10月 09 12:37:42 gpu ollama[40766]: time=2024-10-09T12:37:42.126+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - kv 33: general.quantization_version u32 = 2
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - type f32: 321 tensors
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - type q4_K: 385 tensors
10月 09 12:37:42 gpu ollama[40766]: llama_model_loader: - type q6_K: 65 tensors
10月 09 12:37:42 gpu ollama[40766]: llm_load_vocab: special tokens cache size = 22
10月 09 12:37:42 gpu ollama[40766]: llm_load_vocab: token to piece cache size = 0.9310 MB
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: format = GGUF V3 (latest)
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: arch = qwen2
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: vocab type = BPE
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_vocab = 152064
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_merges = 151387
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: vocab_only = 0
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_ctx_train = 32768
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_embd = 5120
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_layer = 64
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_head = 40
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_head_kv = 8
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_rot = 128
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_swa = 0
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_embd_head_k = 128
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_embd_head_v = 128
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_gqa = 5
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_embd_k_gqa = 1024
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_embd_v_gqa = 1024
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: f_norm_eps = 0.0e+00
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: f_logit_scale = 0.0e+00
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_ff = 27648
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_expert = 0
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_expert_used = 0
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: causal attn = 1
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: pooling type = 0
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: rope type = 2
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: rope scaling = linear
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: freq_base_train = 1000000.0
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: freq_scale_train = 1
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: n_ctx_orig_yarn = 32768
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: rope_finetuned = unknown
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: ssm_d_conv = 0
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: ssm_d_inner = 0
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: ssm_d_state = 0
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: ssm_dt_rank = 0
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: ssm_dt_b_c_rms = 0
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: model type = ?B
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: model ftype = Q4_K - Medium
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: model params = 32.76 B
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: model size = 18.48 GiB (4.85 BPW)
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: general.name = Qwen2.5 32B Instruct
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: EOS token = 151645 '<|im_end|>'
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: LF token = 148848 'ÄĬ'
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: EOT token = 151645 '<|im_end|>'
10月 09 12:37:42 gpu ollama[40766]: llm_load_print_meta: max token length = 256
10月 09 12:37:42 gpu ollama[40766]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
10月 09 12:37:42 gpu ollama[40766]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
10月 09 12:37:42 gpu ollama[40766]: ggml_cuda_init: found 4 CUDA devices:
10月 09 12:37:42 gpu ollama[40766]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes
10月 09 12:37:42 gpu ollama[40766]: Device 1: NVIDIA A30, compute capability 8.0, VMM: yes
10月 09 12:37:42 gpu ollama[40766]: Device 2: NVIDIA A30, compute capability 8.0, VMM: yes
10月 09 12:37:42 gpu ollama[40766]: Device 3: NVIDIA A30, compute capability 8.0, VMM: yes
10月 09 12:37:42 gpu ollama[40766]: llm_load_tensors: ggml ctx size = 1.69 MiB
10月 09 12:37:43 gpu ollama[40766]: time=2024-10-09T12:37:43.583+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
10月 09 12:37:45 gpu ollama[40766]: time=2024-10-09T12:37:45.233+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
10月 09 12:37:46 gpu ollama[40766]: llm_load_tensors: offloading 64 repeating layers to GPU
10月 09 12:37:46 gpu ollama[40766]: llm_load_tensors: offloading non-repeating layers to GPU
10月 09 12:37:46 gpu ollama[40766]: llm_load_tensors: offloaded 65/65 layers to GPU
10月 09 12:37:46 gpu ollama[40766]: llm_load_tensors: CPU buffer size = 417.66 MiB
10月 09 12:37:46 gpu ollama[40766]: llm_load_tensors: CUDA0 buffer size = 4844.72 MiB
10月 09 12:37:46 gpu ollama[40766]: llm_load_tensors: CUDA1 buffer size = 4366.53 MiB
10月 09 12:37:46 gpu ollama[40766]: llm_load_tensors: CUDA2 buffer size = 4366.53 MiB
10月 09 12:37:46 gpu ollama[40766]: llm_load_tensors: CUDA3 buffer size = 4930.57 MiB
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: n_ctx = 30016
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: n_batch = 512
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: n_ubatch = 512
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: flash_attn = 0
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: freq_base = 1000000.0
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: freq_scale = 1
10月 09 12:37:49 gpu ollama[40766]: llama_kv_cache_init: CUDA0 KV buffer size = 1993.25 MiB
10月 09 12:37:49 gpu ollama[40766]: llama_kv_cache_init: CUDA1 KV buffer size = 1876.00 MiB
10月 09 12:37:49 gpu ollama[40766]: llama_kv_cache_init: CUDA2 KV buffer size = 1876.00 MiB
10月 09 12:37:49 gpu ollama[40766]: llama_kv_cache_init: CUDA3 KV buffer size = 1758.75 MiB
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: KV self size = 7504.00 MiB, K (f16): 3752.00 MiB, V (f16): 3752.00 MiB
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: CUDA_Host output buffer size = 0.60 MiB
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: CUDA0 compute buffer size = 2659.51 MiB
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: CUDA1 compute buffer size = 2659.51 MiB
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: CUDA2 compute buffer size = 2659.51 MiB
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: CUDA3 compute buffer size = 2659.52 MiB
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: CUDA_Host compute buffer size = 244.52 MiB
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: graph nodes = 2246
10月 09 12:37:49 gpu ollama[40766]: llama_new_context_with_model: graph splits = 5
10月 09 12:37:49 gpu ollama[40766]: INFO [main] model loaded | tid="140652491755520" timestamp=1728448669
10月 09 12:37:49 gpu ollama[40766]: time=2024-10-09T12:37:49.646+08:00 level=INFO source=server.go:626 msg="llama runner started in 7.77 seconds"
10月 09 12:37:51 gpu ollama[40766]: [GIN] 2024/10/09 - 12:37:51 | 200 | 10.947908166s | 172.22.1.39 | POST "/api/chat"
OS
Linux
GPU
Nvidia
CPU
Intel
Ollama version
ollama version is 0.3.11
@goactiongo commented on GitHub (Oct 9, 2024):
When changed the model to glm4:9b ( --ctx-size 128001),there is no information output
AI debug logs
[Info] 2024-10-09 05:08:00 [Vector Queue] Done
[Info] 2024-10-09 05:08:00 [QA Queue] Done
[Warn] 2024-10-09 05:08:16 LLM response error {"requestBody":{"num_ctx":128001,"model":"glm4:9b","temperature":0.01,"max_tokens":200,"stream":true,"messages":[{"role":"system","content":"Consider the content within as your knowledge\n\nFile: 600007.pdf\n\n...........HERE, 20,000 CHARACTERS IN 600007.pdf HAVE BEEN OMITTED........ \n\n\n"},{"role":"user","content":"Summarize the document content"}]}}
[Error] 2024-10-09 05:08:16 sse error: LLM model response empty
[Warn] 2024-10-09 05:08:16 Request finish /api/core/chat/chatTest, time: 130054ms
[Info] 2024-10-09 05:09:00 [QA Queue] Done
[Info] 2024-10-09 05:09:00 [Vector Queue] Done
ollama debug logs
10月 09 13:15:52 gpu ollama[42508]: [GIN] 2024/10/09 - 13:15:52 | 200 | 4m6s | 172.22.1.39 | POST "/api/chat"
10月 09 13:15:55 gpu ollama[42508]: time=2024-10-09T13:15:55.911+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 gpu=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 parallel=1 available=21459828736 required="18.4 GiB"
10月 09 13:15:55 gpu ollama[42508]: time=2024-10-09T13:15:55.911+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="112.6 GiB" free_swap="3.7 GiB"
10月 09 13:15:55 gpu ollama[42508]: time=2024-10-09T13:15:55.911+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[20.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.4 GiB" memory.required.partial="18.4 GiB" memory.required.kv="4.9 GiB" memory.required.allocations="[18.4 GiB]" memory.weights.total="9.2 GiB" memory.weights.repeating="8.7 GiB" memory.weights.nonrepeating="485.6 MiB" memory.graph.full="8.1 GiB" memory.graph.partial="8.2 GiB"
10月 09 13:15:55 gpu ollama[42508]: time=2024-10-09T13:15:55.922+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama3332959923/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 --ctx-size 128001 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 1 --port 43492"
10月 09 13:15:55 gpu ollama[42508]: time=2024-10-09T13:15:55.923+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
10月 09 13:15:55 gpu ollama[42508]: time=2024-10-09T13:15:55.923+08:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
10月 09 13:15:55 gpu ollama[42508]: time=2024-10-09T13:15:55.924+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
10月 09 13:15:55 gpu ollama[42508]: INFO [main] build info | build=10 commit="9225b05" tid="140519079464960" timestamp=1728450955
10月 09 13:15:55 gpu ollama[42508]: INFO [main] system info | n_threads=32 n_threads_batch=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140519079464960" timestamp=1728450955 total_threads=64
10月 09 13:15:55 gpu ollama[42508]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="43492" tid="140519079464960" timestamp=1728450955
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: loaded meta data with 24 key-value pairs and 283 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 (version GGUF V3 (latest))
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 0: general.architecture str = chatglm
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 1: general.name str = glm-4-9b-chat
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 2: chatglm.context_length u32 = 131072
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 3: chatglm.embedding_length u32 = 4096
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 4: chatglm.feed_forward_length u32 = 13696
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 5: chatglm.block_count u32 = 40
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 6: chatglm.attention.head_count u32 = 32
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 7: chatglm.attention.head_count_kv u32 = 2
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 8: chatglm.attention.layer_norm_rms_epsilon f32 = 0.000000
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 9: general.file_type u32 = 2
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 10: chatglm.rope.dimension_count u32 = 64
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 12: chatglm.rope.freq_base f32 = 5000000.000000
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 14: tokenizer.ggml.pre str = chatglm-bpe
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,151552] = ["!", """, "#", "$", "%", "&", "'", ...
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
10月 09 13:15:56 gpu ollama[42508]: time=2024-10-09T13:15:56.177+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,151073] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 151329
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 151329
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 20: tokenizer.ggml.eot_token_id u32 = 151336
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 151329
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 22: tokenizer.chat_template str = [gMASK]{% for item in messages %...
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - kv 23: general.quantization_version u32 = 2
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - type f32: 121 tensors
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - type q4_0: 161 tensors
10月 09 13:15:56 gpu ollama[42508]: llama_model_loader: - type q6_K: 1 tensors
10月 09 13:15:56 gpu ollama[42508]: llm_load_vocab: special tokens cache size = 223
10月 09 13:15:56 gpu ollama[42508]: llm_load_vocab: token to piece cache size = 0.9732 MB
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: format = GGUF V3 (latest)
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: arch = chatglm
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: vocab type = BPE
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_vocab = 151552
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_merges = 151073
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: vocab_only = 0
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_ctx_train = 131072
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_embd = 4096
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_layer = 40
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_head = 32
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_head_kv = 2
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_rot = 64
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_swa = 0
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_embd_head_k = 128
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_embd_head_v = 128
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_gqa = 16
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_embd_k_gqa = 256
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_embd_v_gqa = 256
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: f_norm_eps = 0.0e+00
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: f_norm_rms_eps = 1.6e-07
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: f_logit_scale = 0.0e+00
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_ff = 13696
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_expert = 0
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_expert_used = 0
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: causal attn = 1
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: pooling type = 0
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: rope type = 0
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: rope scaling = linear
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: freq_base_train = 5000000.0
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: freq_scale_train = 1
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: n_ctx_orig_yarn = 131072
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: rope_finetuned = unknown
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: ssm_d_conv = 0
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: ssm_d_inner = 0
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: ssm_d_state = 0
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: ssm_dt_rank = 0
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: ssm_dt_b_c_rms = 0
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: model type = 9B
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: model ftype = Q4_0
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: model params = 9.40 B
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: model size = 5.08 GiB (4.64 BPW)
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: general.name = glm-4-9b-chat
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: EOS token = 151329 '<|endoftext|>'
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: UNK token = 151329 '<|endoftext|>'
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: PAD token = 151329 '<|endoftext|>'
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: LF token = 128 'Ä'
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: EOT token = 151336 '<|user|>'
10月 09 13:15:56 gpu ollama[42508]: llm_load_print_meta: max token length = 1024
10月 09 13:15:56 gpu ollama[42508]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
10月 09 13:15:56 gpu ollama[42508]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
10月 09 13:15:56 gpu ollama[42508]: ggml_cuda_init: found 1 CUDA devices:
10月 09 13:15:56 gpu ollama[42508]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes
10月 09 13:15:57 gpu ollama[42508]: llm_load_tensors: ggml ctx size = 0.28 MiB
10月 09 13:15:57 gpu ollama[42508]: time=2024-10-09T13:15:57.634+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
10月 09 13:15:57 gpu ollama[42508]: time=2024-10-09T13:15:57.949+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
10月 09 13:15:57 gpu ollama[42508]: llm_load_tensors: offloading 40 repeating layers to GPU
10月 09 13:15:57 gpu ollama[42508]: llm_load_tensors: offloading non-repeating layers to GPU
10月 09 13:15:57 gpu ollama[42508]: llm_load_tensors: offloaded 41/41 layers to GPU
10月 09 13:15:57 gpu ollama[42508]: llm_load_tensors: CPU buffer size = 333.00 MiB
10月 09 13:15:57 gpu ollama[42508]: llm_load_tensors: CUDA0 buffer size = 4863.85 MiB
10月 09 13:15:58 gpu ollama[42508]: llama_new_context_with_model: n_ctx = 128032
10月 09 13:15:58 gpu ollama[42508]: llama_new_context_with_model: n_batch = 512
10月 09 13:15:58 gpu ollama[42508]: llama_new_context_with_model: n_ubatch = 512
10月 09 13:15:58 gpu ollama[42508]: llama_new_context_with_model: flash_attn = 0
10月 09 13:15:58 gpu ollama[42508]: llama_new_context_with_model: freq_base = 5000000.0
10月 09 13:15:58 gpu ollama[42508]: llama_new_context_with_model: freq_scale = 1
10月 09 13:15:58 gpu ollama[42508]: llama_kv_cache_init: CUDA0 KV buffer size = 5001.25 MiB
10月 09 13:15:58 gpu ollama[42508]: llama_new_context_with_model: KV self size = 5001.25 MiB, K (f16): 2500.62 MiB, V (f16): 2500.62 MiB
10月 09 13:15:58 gpu ollama[42508]: llama_new_context_with_model: CUDA_Host output buffer size = 0.59 MiB
10月 09 13:15:58 gpu ollama[42508]: llama_new_context_with_model: CUDA0 compute buffer size = 8285.07 MiB
10月 09 13:15:58 gpu ollama[42508]: llama_new_context_with_model: CUDA_Host compute buffer size = 258.07 MiB
10月 09 13:15:58 gpu ollama[42508]: llama_new_context_with_model: graph nodes = 1606
10月 09 13:15:58 gpu ollama[42508]: llama_new_context_with_model: graph splits = 2
10月 09 13:15:59 gpu ollama[42508]: INFO [main] model loaded | tid="140519079464960" timestamp=1728450959
10月 09 13:15:59 gpu ollama[42508]: time=2024-10-09T13:15:59.459+08:00 level=INFO source=server.go:626 msg="llama runner started in 3.54 seconds"
10月 09 13:18:09 gpu ollama[42508]: [GIN] 2024/10/09 - 13:18:09 | 200 | 4m13s | 172.22.1.39 | POST "/api/chat"
@goactiongo commented on GitHub (Oct 9, 2024):
Here is a success example with short text content by using glm4:9b ( --ctx-size 128001)
AI DEBUG logs
ollama debug logs
10月 09 13:24:02 gpu ollama[44027]: time=2024-10-09T13:24:02.807+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 gpu=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 parallel=1 available=21459828736 required="18.4 GiB"
10月 09 13:24:02 gpu ollama[44027]: time=2024-10-09T13:24:02.807+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="112.7 GiB" free_swap="3.7 GiB"
10月 09 13:24:02 gpu ollama[44027]: time=2024-10-09T13:24:02.808+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[20.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.4 GiB" memory.required.partial="18.4 GiB" memory.required.kv="4.9 GiB" memory.required.allocations="[18.4 GiB]" memory.weights.total="9.2 GiB" memory.weights.repeating="8.7 GiB" memory.weights.nonrepeating="485.6 MiB" memory.graph.full="8.1 GiB" memory.graph.partial="8.2 GiB"
10月 09 13:24:02 gpu ollama[44027]: time=2024-10-09T13:24:02.820+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama3286374914/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 --ctx-size 128001 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 1 --port 39005"
10月 09 13:24:02 gpu ollama[44027]: time=2024-10-09T13:24:02.821+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
10月 09 13:24:02 gpu ollama[44027]: time=2024-10-09T13:24:02.821+08:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
10月 09 13:24:02 gpu ollama[44027]: time=2024-10-09T13:24:02.822+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
10月 09 13:24:02 gpu ollama[44027]: INFO [main] build info | build=10 commit="9225b05" tid="139725003341824" timestamp=1728451442
10月 09 13:24:02 gpu ollama[44027]: INFO [main] system info | n_threads=32 n_threads_batch=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139725003341824" timestamp=1728451442 total_threads=64
10月 09 13:24:02 gpu ollama[44027]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="39005" tid="139725003341824" timestamp=1728451442
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: loaded meta data with 24 key-value pairs and 283 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 (version GGUF V3 (latest))
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: - kv 0: general.architecture str = chatglm
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: - kv 1: general.name str = glm-4-9b-chat
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: - kv 2: chatglm.context_length u32 = 131072
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: - kv 3: chatglm.embedding_length u32 = 4096
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: - kv 4: chatglm.feed_forward_length u32 = 13696
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: - kv 5: chatglm.block_count u32 = 40
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: - kv 6: chatglm.attention.head_count u32 = 32
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: - kv 7: chatglm.attention.head_count_kv u32 = 2
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: - kv 8: chatglm.attention.layer_norm_rms_epsilon f32 = 0.000000
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: - kv 9: general.file_type u32 = 2
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: - kv 10: chatglm.rope.dimension_count u32 = 64
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: - kv 12: chatglm.rope.freq_base f32 = 5000000.000000
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
10月 09 13:24:02 gpu ollama[44027]: llama_model_loader: - kv 14: tokenizer.ggml.pre str = chatglm-bpe
10月 09 13:24:03 gpu ollama[44027]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,151552] = ["!", """, "#", "$", "%", "&", "'", ...
10月 09 13:24:03 gpu ollama[44027]: llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
10月 09 13:24:03 gpu ollama[44027]: time=2024-10-09T13:24:03.076+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
10月 09 13:24:03 gpu ollama[44027]: llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,151073] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
10月 09 13:24:03 gpu ollama[44027]: llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 151329
10月 09 13:24:03 gpu ollama[44027]: llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 151329
10月 09 13:24:03 gpu ollama[44027]: llama_model_loader: - kv 20: tokenizer.ggml.eot_token_id u32 = 151336
10月 09 13:24:03 gpu ollama[44027]: llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 151329
10月 09 13:24:03 gpu ollama[44027]: llama_model_loader: - kv 22: tokenizer.chat_template str = [gMASK]{% for item in messages %...
10月 09 13:24:03 gpu ollama[44027]: llama_model_loader: - kv 23: general.quantization_version u32 = 2
10月 09 13:24:03 gpu ollama[44027]: llama_model_loader: - type f32: 121 tensors
10月 09 13:24:03 gpu ollama[44027]: llama_model_loader: - type q4_0: 161 tensors
10月 09 13:24:03 gpu ollama[44027]: llama_model_loader: - type q6_K: 1 tensors
10月 09 13:24:03 gpu ollama[44027]: llm_load_vocab: special tokens cache size = 223
10月 09 13:24:03 gpu ollama[44027]: llm_load_vocab: token to piece cache size = 0.9732 MB
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: format = GGUF V3 (latest)
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: arch = chatglm
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: vocab type = BPE
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_vocab = 151552
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_merges = 151073
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: vocab_only = 0
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_ctx_train = 131072
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_embd = 4096
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_layer = 40
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_head = 32
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_head_kv = 2
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_rot = 64
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_swa = 0
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_embd_head_k = 128
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_embd_head_v = 128
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_gqa = 16
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_embd_k_gqa = 256
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_embd_v_gqa = 256
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: f_norm_eps = 0.0e+00
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: f_norm_rms_eps = 1.6e-07
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: f_logit_scale = 0.0e+00
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_ff = 13696
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_expert = 0
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_expert_used = 0
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: causal attn = 1
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: pooling type = 0
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: rope type = 0
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: rope scaling = linear
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: freq_base_train = 5000000.0
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: freq_scale_train = 1
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: n_ctx_orig_yarn = 131072
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: rope_finetuned = unknown
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: ssm_d_conv = 0
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: ssm_d_inner = 0
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: ssm_d_state = 0
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: ssm_dt_rank = 0
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: ssm_dt_b_c_rms = 0
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: model type = 9B
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: model ftype = Q4_0
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: model params = 9.40 B
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: model size = 5.08 GiB (4.64 BPW)
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: general.name = glm-4-9b-chat
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: EOS token = 151329 '<|endoftext|>'
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: UNK token = 151329 '<|endoftext|>'
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: PAD token = 151329 '<|endoftext|>'
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: LF token = 128 'Ä'
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: EOT token = 151336 '<|user|>'
10月 09 13:24:03 gpu ollama[44027]: llm_load_print_meta: max token length = 1024
10月 09 13:24:03 gpu ollama[44027]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
10月 09 13:24:03 gpu ollama[44027]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
10月 09 13:24:03 gpu ollama[44027]: ggml_cuda_init: found 1 CUDA devices:
10月 09 13:24:03 gpu ollama[44027]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes
10月 09 13:24:04 gpu ollama[44027]: llm_load_tensors: ggml ctx size = 0.28 MiB
10月 09 13:24:04 gpu ollama[44027]: time=2024-10-09T13:24:04.537+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
10月 09 13:24:04 gpu ollama[44027]: time=2024-10-09T13:24:04.877+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
10月 09 13:24:04 gpu ollama[44027]: llm_load_tensors: offloading 40 repeating layers to GPU
10月 09 13:24:04 gpu ollama[44027]: llm_load_tensors: offloading non-repeating layers to GPU
10月 09 13:24:04 gpu ollama[44027]: llm_load_tensors: offloaded 41/41 layers to GPU
10月 09 13:24:04 gpu ollama[44027]: llm_load_tensors: CPU buffer size = 333.00 MiB
10月 09 13:24:04 gpu ollama[44027]: llm_load_tensors: CUDA0 buffer size = 4863.85 MiB
10月 09 13:24:05 gpu ollama[44027]: llama_new_context_with_model: n_ctx = 128032
10月 09 13:24:05 gpu ollama[44027]: llama_new_context_with_model: n_batch = 512
10月 09 13:24:05 gpu ollama[44027]: llama_new_context_with_model: n_ubatch = 512
10月 09 13:24:05 gpu ollama[44027]: llama_new_context_with_model: flash_attn = 0
10月 09 13:24:05 gpu ollama[44027]: llama_new_context_with_model: freq_base = 5000000.0
10月 09 13:24:05 gpu ollama[44027]: llama_new_context_with_model: freq_scale = 1
10月 09 13:24:05 gpu ollama[44027]: llama_kv_cache_init: CUDA0 KV buffer size = 5001.25 MiB
10月 09 13:24:05 gpu ollama[44027]: llama_new_context_with_model: KV self size = 5001.25 MiB, K (f16): 2500.62 MiB, V (f16): 2500.62 MiB
10月 09 13:24:05 gpu ollama[44027]: llama_new_context_with_model: CUDA_Host output buffer size = 0.59 MiB
10月 09 13:24:05 gpu ollama[44027]: llama_new_context_with_model: CUDA0 compute buffer size = 8285.07 MiB
10月 09 13:24:05 gpu ollama[44027]: llama_new_context_with_model: CUDA_Host compute buffer size = 258.07 MiB
10月 09 13:24:05 gpu ollama[44027]: llama_new_context_with_model: graph nodes = 1606
10月 09 13:24:05 gpu ollama[44027]: llama_new_context_with_model: graph splits = 2
10月 09 13:24:06 gpu ollama[44027]: INFO [main] model loaded | tid="139725003341824" timestamp=1728451446
10月 09 13:24:06 gpu ollama[44027]: time=2024-10-09T13:24:06.140+08:00 level=INFO source=server.go:626 msg="llama runner started in 3.32 seconds"
10月 09 13:24:07 gpu ollama[44027]: [GIN] 2024/10/09 - 13:24:07 | 200 | 5.883286351s | 172.22.1.39 | POST "/api/chat"
@rick-github commented on GitHub (Oct 9, 2024):
Can you upload the PDF and give a pointer to the client that you are using?
@goactiongo commented on GitHub (Oct 11, 2024):
ollama version is 0.3.11
I used the FastGPT
I'm certain that this issue is not related to the PDF format. If I use a WORD file with long text content , the same problem would occur. As long as the content in the document is lengthy, this error will exist.
But as long as the content in the document is minimal, it can be recognized normally, whether it's a PDF or a Word.
Here ,it's the same issue when I used a DOCX file test1.docx within 5000+ charaters.
AI debug logs
ollama debug logs
10月 11 13:51:37 gpu ollama[60354]: time=2024-10-11T13:51:37.930+08:00 level=INFO source=common.go:49 msg="Dynamic LLM librari es" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]"
10月 11 13:51:37 gpu ollama[60354]: time=2024-10-11T13:51:37.931+08:00 level=INFO source=gpu.go:199 msg="looking for compatibl e GPUs"
10月 11 13:51:39 gpu ollama[60354]: time=2024-10-11T13:51:39.194+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB " available="14.7 GiB"
10月 11 13:51:39 gpu ollama[60354]: time=2024-10-11T13:51:39.194+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB " available="16.8 GiB"
10月 11 13:51:39 gpu ollama[60354]: time=2024-10-11T13:51:39.194+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB " available="23.3 GiB"
10月 11 13:51:39 gpu ollama[60354]: time=2024-10-11T13:51:39.194+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB " available="23.3 GiB"
10月 11 13:51:40 gpu ollama[60354]: time=2024-10-11T13:51:40.354+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8f da98120fefed5881934161ede8bd1a41 gpu=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 parallel=4 available=24986779648 required="21.5 GiB"
10月 11 13:51:40 gpu ollama[60354]: time=2024-10-11T13:51:40.354+08:00 level=INFO source=server.go:103 msg="system memory" tot al="125.4 GiB" free="114.3 GiB" free_swap="3.7 GiB"
10月 11 13:51:40 gpu ollama[60354]: time=2024-10-11T13:51:40.356+08:00 level=INFO source=memory.go:326 msg="offload to cuda" l ayers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[23.3 GiB]" memory.gpu_overhead="0 B" m emory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[2 1.5 GiB]" memory.weights.total="19.5 GiB" memory.weights.repeating="18.9 GiB" memory.weights.nonrepeating="609.1 MiB" memory.g raph.full="676.0 MiB" memory.graph.partial="916.1 MiB"
10月 11 13:51:40 gpu ollama[60354]: time=2024-10-11T13:51:40.374+08:00 level=INFO source=server.go:388 msg="starting llama ser ver" cmd="/tmp/ollama3399730686/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-eab c98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-g pu-layers 65 --parallel 4 --port 45366"
10月 11 13:51:40 gpu ollama[60354]: time=2024-10-11T13:51:40.375+08:00 level=INFO source=sched.go:449 msg="loaded runners" cou nt=1
10月 11 13:51:40 gpu ollama[60354]: time=2024-10-11T13:51:40.375+08:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
10月 11 13:51:40 gpu ollama[60354]: time=2024-10-11T13:51:40.376+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
10月 11 13:51:40 gpu ollama[60354]: INFO [main] build info | build=10 commit="9225b05" tid="140516332331008" timestamp=1728625 900
10月 11 13:51:40 gpu ollama[60354]: INFO [main] system info | n_threads=32 n_threads_batch=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " ti d="140516332331008" timestamp=1728625900 total_threads=64
10月 11 13:51:40 gpu ollama[60354]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="45366" tid="140516332331008" timestamp=1728625900
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: loaded meta data with 34 key-value pairs and 771 tensors from /usr/sha re/ollama/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 (version GGUF V3 (lates t))
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 0: general.architecture str = qwen2
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 1: general.type str = model
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 2: general.name str = Qwen2.5 32B Instruct
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 3: general.finetune str = Instruct
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 4: general.basename str = Qwen2.5
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 5: general.size_label str = 32B
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 6: general.license str = apache-2.0
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-3...
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 8: general.base_model.count u32 = 1
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 9: general.base_model.0.name str = Qwen2.5 32B
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 10: general.base_model.0.organization str = Qwen
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-32B
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 12: general.tags arr[str,2] = ["chat", "text-generation"]
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"]
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 14: qwen2.block_count u32 = 64
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 15: qwen2.context_length u32 = 32768
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 16: qwen2.embedding_length u32 = 5120
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 17: qwen2.feed_forward_length u32 = 27648
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 18: qwen2.attention.head_count u32 = 40
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 19: qwen2.attention.head_count_kv u32 = 8
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 20: qwen2.rope.freq_base f32 = 1000000.000000
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 21: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 22: general.file_type u32 = 15
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
10月 11 13:51:40 gpu ollama[60354]: time=2024-10-11T13:51:40.628+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - kv 33: general.quantization_version u32 = 2
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - type f32: 321 tensors
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - type q4_K: 385 tensors
10月 11 13:51:40 gpu ollama[60354]: llama_model_loader: - type q6_K: 65 tensors
10月 11 13:51:40 gpu ollama[60354]: llm_load_vocab: special tokens cache size = 22
10月 11 13:51:41 gpu ollama[60354]: llm_load_vocab: token to piece cache size = 0.9310 MB
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: format = GGUF V3 (latest)
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: arch = qwen2
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: vocab type = BPE
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_vocab = 152064
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_merges = 151387
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: vocab_only = 0
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_ctx_train = 32768
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_embd = 5120
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_layer = 64
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_head = 40
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_head_kv = 8
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_rot = 128
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_swa = 0
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_embd_head_k = 128
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_embd_head_v = 128
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_gqa = 5
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_embd_k_gqa = 1024
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_embd_v_gqa = 1024
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: f_norm_eps = 0.0e+00
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: f_logit_scale = 0.0e+00
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_ff = 27648
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_expert = 0
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_expert_used = 0
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: causal attn = 1
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: pooling type = 0
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: rope type = 2
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: rope scaling = linear
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: freq_base_train = 1000000.0
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: freq_scale_train = 1
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: n_ctx_orig_yarn = 32768
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: rope_finetuned = unknown
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: ssm_d_conv = 0
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: ssm_d_inner = 0
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: ssm_d_state = 0
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: ssm_dt_rank = 0
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: ssm_dt_b_c_rms = 0
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: model type = ?B
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: model ftype = Q4_K - Medium
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: model params = 32.76 B
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: model size = 18.48 GiB (4.85 BPW)
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: general.name = Qwen2.5 32B Instruct
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: EOS token = 151645 '<|im_end|>'
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: LF token = 148848 'ÄĬ'
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: EOT token = 151645 '<|im_end|>'
10月 11 13:51:41 gpu ollama[60354]: llm_load_print_meta: max token length = 256
10月 11 13:51:41 gpu ollama[60354]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
10月 11 13:51:41 gpu ollama[60354]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
10月 11 13:51:41 gpu ollama[60354]: ggml_cuda_init: found 1 CUDA devices:
10月 11 13:51:41 gpu ollama[60354]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes
10月 11 13:51:41 gpu ollama[60354]: llm_load_tensors: ggml ctx size = 0.68 MiB
10月 11 13:51:42 gpu ollama[60354]: time=2024-10-11T13:51:42.085+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
10月 11 13:51:43 gpu ollama[60354]: time=2024-10-11T13:51:43.741+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
10月 11 13:51:43 gpu ollama[60354]: llm_load_tensors: offloading 64 repeating layers to GPU
10月 11 13:51:43 gpu ollama[60354]: llm_load_tensors: offloading non-repeating layers to GPU
10月 11 13:51:43 gpu ollama[60354]: llm_load_tensors: offloaded 65/65 layers to GPU
10月 11 13:51:43 gpu ollama[60354]: llm_load_tensors: CPU buffer size = 417.66 MiB
10月 11 13:51:43 gpu ollama[60354]: llm_load_tensors: CUDA0 buffer size = 18508.35 MiB
10月 11 13:51:46 gpu ollama[60354]: time=2024-10-11T13:51:46.453+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
10月 11 13:51:46 gpu ollama[60354]: llama_new_context_with_model: n_ctx = 8192
10月 11 13:51:46 gpu ollama[60354]: llama_new_context_with_model: n_batch = 512
10月 11 13:51:46 gpu ollama[60354]: llama_new_context_with_model: n_ubatch = 512
10月 11 13:51:46 gpu ollama[60354]: llama_new_context_with_model: flash_attn = 0
10月 11 13:51:46 gpu ollama[60354]: llama_new_context_with_model: freq_base = 1000000.0
10月 11 13:51:46 gpu ollama[60354]: llama_new_context_with_model: freq_scale = 1
10月 11 13:51:46 gpu ollama[60354]: llama_kv_cache_init: CUDA0 KV buffer size = 2048.00 MiB
10月 11 13:51:46 gpu ollama[60354]: llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
10月 11 13:51:46 gpu ollama[60354]: llama_new_context_with_model: CUDA_Host output buffer size = 2.40 MiB
10月 11 13:51:46 gpu ollama[60354]: llama_new_context_with_model: CUDA0 compute buffer size = 696.00 MiB
10月 11 13:51:46 gpu ollama[60354]: llama_new_context_with_model: CUDA_Host compute buffer size = 26.01 MiB
10月 11 13:51:46 gpu ollama[60354]: llama_new_context_with_model: graph nodes = 2246
10月 11 13:51:46 gpu ollama[60354]: llama_new_context_with_model: graph splits = 2
10月 11 13:51:46 gpu ollama[60354]: time=2024-10-11T13:51:46.705+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
10月 11 13:51:46 gpu ollama[60354]: INFO [main] model loaded | tid="140516332331008" timestamp=1728625906
10月 11 13:51:46 gpu ollama[60354]: time=2024-10-11T13:51:46.957+08:00 level=INFO source=server.go:626 msg="llama runner start ed in 6.58 seconds"
10月 11 13:51:50 gpu ollama[60354]: [GIN] 2024/10/11 - 13:51:50 | 200 | 11.286623482s | 172.16.1.219 | POST "/api/chat"
@goactiongo commented on GitHub (Oct 11, 2024):
@rick-github hi bro, any feedback,pls
@rick-github commented on GitHub (Oct 11, 2024):
I asked about the client so that I could replicate the problem as closely as possible, but it looks too time-consuming so I instead just tested ollama.
Baseline: test with a simple text file:
Output:
Un-JSON'ified:
Test with 600007.pdf as per your description of the problem:
Output:
Un-JSON'ified:
I translated the document with Google translate and while I didn't read enough to verify the facts, the summary looks relevant.
Your problem description notes that FastGPT sent 20,000 characters to ollama. I assume that's a configuration setting in FastGPT. I retried the test with 100,000 characters of the PDF file and it also worked:
Testing with qwen2.5:32b and a context window of 30001 returned the following:
Un-JSON'ified:
So ollama appears to be working as expected. If you add
OLLAMA_DEBUG=1to your server environment and try again, the resulting logs may indicate where the problem is. My guess is that FastGPT is dumping the raw document into the prompt, but since I didn't get it working, I can't say for sure.@goactiongo commented on GitHub (Oct 11, 2024):
Thank you very much for your reply.
Could you please take a look at the ollama log information I sent out before?
It's strange why short texts can be recognized for file content, but long texts cannot be identified.
As AI log shown,FASTGPT first parses the attachment to extract the specific content, then places it in the SYSTEM, and then sends the SYSTEM information and USER commands to the ollama local model for processing.
Strangely,
2.If I used an online public cloud LLM to replace the ollama local model, it can be recognized normally , no matter the size of the file。
Could you please review the ollama logs I sent before to see if any issues?
Let me know if you need further assistance!
THANKS AGAIN
@rick-github commented on GitHub (Oct 11, 2024):
Your logs don't contain enough information, you removed the actual content that FastGPT is sending to ollama. If you add OLLAMA_DEBUG=1 to your server environment and try again, the resulting logs may indicate where the problem is.
@goactiongo commented on GitHub (Oct 12, 2024):
Thanks. pls review the information as followed (add OLLAMA_DEBUG=1 to my server environment)
FastGPT log (to make it easier for you to understand, I have translated some of the Chinese into English)
ollama log
Additional Note
If I place the content of the document in the HUMAN prompt (previously placed in the SYSTEM prompt as default), ollama can recognize and summarize the document content,However, with this approach, the LLM does not follow instructions well and often provides irrelevant responses..
May I ask if there is a character limit for the prompts in the SYSTEM context for the ollama model?
Due to certain functional limitations, I have to continue placing it in the SYSTEM prompt. Do you have any related suggestions?
@rick-github commented on GitHub (Oct 12, 2024):
You need to set
OLLAMA_DEBUG=1in the server environment. Run this command:Add these lines:
Save and exit, then run
Run your test. Then run
Attach
ollama.logto this issue.There shouldn't be any character limits other than the overall context window size. The
systemprompt is usually used for just setting the style or purpose of a model, and context is usually added to theuserrole. But that really depends on the model. Some may be fine with context in thesystemrole, others have been trained to expect it in theuserrole. In my experiments so far, both qwen2.5 and glm4 work fine with context in thesystemrole. We just need to figure out why it doesn't seem to work in your case.@goactiongo commented on GitHub (Oct 13, 2024):
Thanks for your replay. ollama log file attached,pls. Fell free to let me know if more information needed...
ollama.log
You can start by reviewing the following conten:
10月 13 10:50:10 gpu systemd[1]: Started Ollama Service.
OR
ollama-1.log
You can start by reviewing the following conten: 10月 13 11:20:51
OR
ollama-2.log
You can start from:10月 13 15:43:53 g
AND
ollama-3.log
This log file is : I put the PDF contents into USER prompt other than into SYSTEM prompts, and success to summarize the content.
pls start from "10月 13 16:11:02 gpu systemd[1]: Started Ollama Service."
AND
ollama-4.log
This log file is : testing with a short content file named test.docx,and it succeed to summarize the content
pls start from 10月 13 16:22:40 gpu systemd[1]: Started Ollama Service.
MORE INFORMATION
AI debug log
@rick-github commented on GitHub (Oct 13, 2024):
Your context size is too small for the size of the document you are attempting to summarize. For most of the tests, you used qwen2.5:32b with a context of 30001:
When you submit a test, ollama computes the size of the payload and when it exceeds the context window, it drops messages:
This results in a prompt with no context:
On the last test you switched to glm4:9b with a context of 128001:
Ollama was able to process this test:
For the test in ollama-3.log where you put the text in the USER prompt, the middle of the text has this:
so not all of the text is being passed in, allowing the remaining text to fix in the context window.
60007.pdf has 91017 characters according to
wc -m, so it should fit in the 128001 token window of glm4:9b. Try that and add the logs if it doesn't work.@goactiongo commented on GitHub (Oct 14, 2024):
Thanks for your information, I almost understand the root cause--- the token qty. which LLM can supported.
The following was testing with glm4:9b, which context length 131072.
Whatever setting --ctx-size 128001 or --ctx-size 96001, The ollama log shown many information as followed
ollama-5.log
It seemd that available memory is enough, why shown "gpu has too little memory to allocate any layers" and "minimum_memory=479199232"
and the logs shown Ollama had received the PDF content successful ,but FastGPT received “network error”
Here is the testing by using llama3.1:8b
ollama-6.log
@rick-github commented on GitHub (Oct 15, 2024):
You have
OLLAMA_NUM_PARALLELunset. In ollama-5.log, you set a context size of 96001. Since you have lots of VRAM, ollama first tries to load the model withOLLAMA_NUM_PARALLEL=4. That is, it needs to find a GPU where it can load the model weights (7.5G), the memory graph (6.1G) and 4 content windows (3.7G * 4), which with some optimization comes to 24.6G. If tries each of your GPUs with this configuration, and each one fails because none of them have enough free VRAM. ollama then retries withOLLAMA_NUM_PARALLEL=1, reducing the VRAM required to 15.1G. The first GPU as 23.3G free, so ollama loads the model on this GPU.Same story in ollama-6.log, but here you have a smaller context of 64001 and a different model so a different amount of VRAM required, 17G. If fails to fit in the first three models (available VRAM 16.8G, 9.7G, 7.0G) but fits in the last one (23.3G available).
@goactiongo commented on GitHub (Oct 16, 2024):
Thank you for your response. I understand your explanation. According to your explanation, both models eventually ran successfully, but why didn't either model output the results as instructed?
Could you please take another look at the log files to see if there are any other errors that might be preventing the output results?
---Original---
From: @.>
Date: Wed, Oct 16, 2024 05:16 AM
To: @.>;
Cc: @.@.>;
Subject: Re: [ollama/ollama] Unable to recognize the long text content. (Issue#7146)
You have OLLAMA_NUM_PARALLEL unset. In ollama-5.log, you set a context size of 96001. Since you have lots of VRAM, ollama first tries to load the model with OLLAMA_NUM_PARALLEL=4. That is, it needs to find a GPU where it can load the model weights (7.5G), the memory graph (6.1G) and 4 content windows (3.7G * 4), which with some optimization comes to 24.6G. If tries each of your GPUs with this configuration, and each one fails because none of them have enough free VRAM. ollama then retries with OLLAMA_NUM_PARALLEL=1, reducing the VRAM required to 15.1G. The first GPU as 23.3G free, so ollama loads the model on this GPU.
Same story in ollama-6.log, but here you have a smaller context of 64001 and a different model so a different amount of VRAM required, 17G. If fails to fit in the first three models (available VRAM 16.8G, 9.7G, 7.0G) but fits in the last one (23.3G available).
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: @.***>
@goactiongo commented on GitHub (Oct 16, 2024):
I have found the root cause of the problem. The reason for the unsuccessful output was that the connect_time and other parameters were not set in nginx. The issue is now resolved. Thank you very much for your help, I really appreciate it.
@goactiongo commented on GitHub (Oct 16, 2024):
Thank you for your guidance, my problem has been resolved.
One more question:
Does OLLAMA_NUM_PARALLEL=4 mean that a model can only run on one GPU card? Even though a single GPU card does not have enough resources to run 4 parallel processes, can the 4 parallel processes be distributed across multiple GPU cards?
@rick-github commented on GitHub (Oct 16, 2024):
You can set
OLLAMA_SCHED_SPREAD=1to have ollama divide the model across all cards. Theoretically this will allow each of the GPUs to work on a different completion at the same time. In practice, the scheduling will vary depending on the input tokens, the probabilistic token generation, and whatever other models are running on the GPU. My suspicion is that you would see a performance improvement, but not 4x. If you wanted to maximize performance, it might be better to run multiple servers as described here. You could move all of the other models on to one or two servers and then have the remaining servers dedicated to the summarization tasks (depending on context size, model, etc).@goactiongo commented on GitHub (Oct 19, 2024):
Thank you for your reply. I have conducted the following two tests, and I am unsure how to handle some issues, so I need your further assistance.
1. Overall Situation Description
1.1 In the same process, three models are sequentially called through the API interface to handle the content summarization task. The models include llama3.1:8b, glm4:9b, llama3.2:latest, and each model has the following parameters set:
1.2. If the environment variable Environment="OLLAMA_SCHED_SPREAD=1" is not set, all three models will run successfully in sequence, but the inference time is relatively long.
Log file ollama2.log
ollama2.log
1.3. After setting Environment="OLLAMA_SCHED_SPREAD=1".
In order to improve the inference efficiency of the three models and to fully utilize four GPU cards for concurrent processing, I set the environment variable Environment="OLLAMA_SCHED_SPREAD=1" as you instructed. However, after multiple tests, the first model runs successfully every time, the second model fails almost every time, and the third model sometimes succeeds and sometimes fails.
Log file ollama1.log
ollama1.log
1.4. In the ollama1.log log, the first model succeeds, and the second and third both fail.
Model 2 API Error Log as followed
Model 3 API Error Log as followed
1.5 Analysis of ollama1.log log
According to your previous instructions, I interpreted the log information in ollama1.log and conducted an analysis. There are some areas where I do not understand or may have misunderstood, and I hope for your assistance.
Note: The following analysis and most of the logs are from ollama1.log. Only in sections 4.5.1 and 5.5 did I compare the relevant information from ollama2.log
2. Four GPU cards, the available resources are 23.3 GiB, 23.3 GiB, 16.8 GiB, 9.7 GiB, as follows,
3. First Model: llama3.1:8b
3.1 By default, OLLAMA_NUM_PARALLEL=4, the required resources are partial_offload="32.3 GiB" full_offload="32.3 GiB", but none of the GPU cards can meet this requirement. As follows,
3.2 From the following log, it can be observed that OLLAMA automatically sets parallel=1, and the required resources are required="55.3 GiB" (I do not understand why the required resources are 55.3G at this time, which is greater than the 32.3G shown in 3.1? Additionally, does the parallel=1 in the following log indicate that the four concurrent processes have been reduced to one, and since it has been reduced to parallel=1, why does it require more resources), as follows
3.3 The model runs on four GPU cards. As follows.
Question: If Environment="OLLAMA_SCHED_SPREAD=1" is not set, why does this model still run on four GPU cards (which does not meet my expectation), while the other two models only run on one GPU card (which meets the expectation), and for this part, you can refer to the log file ollama2.log
3.4 The first model works normally, as follows:
4. Second Model: glm4:9b
4.1 After the first model ends, perhaps because the GPU resources have not been completely released at this time (I am not sure if this is the reason), the available resources of each GPU cannot meet the model's needs of partial_offload="30.9 GiB" full_offload="30.9 GiB". As follows,
4.2 Next, I noticed 'resetting model to expire immediately to make room’
4.3 Then, I noticed the llama server stopped and regained GPU resources (consistent with the initial available resources).
4.4 However, the available resources of each GPU card still cannot meet the needs of the second model.
Question: Is OLLAMA_NUM_PARALLEL still set to 4 at this time?
4.5 Then the model uses four GPU cards to run, and the available resources memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" can meet the memory.required.full="43.1 GiB". As follows,
Question: At this time, OLLAMA_NUM_PARALLEL=4, why doesn't it automatically set OLLAMA_NUM_PARALLEL=1 like the first model? And assess whether any one of the GPU cards can meet the resources required by OLLAMA_NUM_PARALLEL=1?
4.5.1 The following information is from ollama2.log, and it can be observed that when
OLLAMA_SCHED_SPREAD=1is not set, this model eventually runs on one of the GPU cards and ultimately succeeds. (Different from 4.5 and 4.6 above).4.6 Next, I saw the OOM error message in the logs.
I don't understand why these errors occurred, allocating 8574.52 MiB on device 3: cudaMalloc failed: out of memory. From the above logs, it seems that the available resources on device 3 should be sufficient.
4.7 Upon checking the logs through the frontend application, the API interface returned the following error message (this should be due to the errors mentioned above):
5. Third Model: llama3.2:latest
5.1 At this point, the resources of the four GPU cards have been completely released. By default, OLLAMA_NUM_PARALLEL=4, and the required resources are partial_offload="24.9 GiB" full_offload="24.9 GiB", but none of the individual GPU cards can meet the required resources. As follows,
5.2 The following log shows parallel=1, and the model runs on four GPU cards.
Question 1: Why does parallel=1 require more resources, 43.7G, memory.required.full="43.7 GiB", while only 24.9G was needed in 5.1?
Question 2: Why do models 1 and 3 automatically downgrade from parallel=4 to parallel=1, but the second model does not automatically adjust to parallel=1?
5.3 According to the above analysis, the available resources of the four GPU cards (23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB) can meet the model's resource requirement of 43.7G, why does the third model suddenly report an OOM error.
and
5.4 Upon checking the logs through the frontend application, the API interface returned the following error message:
5.5 The following log is from ollama2.log, and it can be observed that the model eventually runs on a single GPU card and ultimately succeeds. (Different from 5.2 and 5.3 above)
6 I monitor the GPU operation by executing the command
gpustat -i 1, as follows:6.1 If the environment variable is set as Environment="OLLAMA_SCHED_SPREAD=1" (as shown in ollama1.log, model 1 runs successfully, while models 2 and 3 fail)
All three models run on GPUs 1, 2, and 3 (which is basically as expected), but I do not know why GPU 0 has not been used or is only occasionally occupied.
6.2 If the environment variable is not set as Environment="OLLAMA_SCHED_SPREAD=1" (as shown in ollama2.log, all three models run successfully)
First model: It runs on GPUs 1, 2, and 3. I do not know why GPU 1 has not been used or is only occasionally occupied, and I am unclear why this model can still run on multiple GPU cards without setting the environment variable Environment="OLLAMA_SCHED_SPREAD=1".
Second model: It runs only on GPU 2 (as expected).
Third model: It runs only on GPU 3 (as expected).
@rick-github commented on GitHub (Oct 20, 2024):
Set
OLLAMA_NUM_PARALLEL=1in your server environment, it reduces confusion.llama3.1:8b always uses 4 GPUs because it cannot fit on one GPU.
OLLAMA_SCHED_SPREAD=1only has an affect when there is an actual decision to be made about whether to run a model on one GPU or all GPUs.A model consumes more VRAM when it runs on multiple GPUs because each slice of the model gets a copy of the memory graph. So total VRAM is roughly weights + contextBuffer + (memoryGraph * numGPU). More GPUs, more space needed for copies of the memory graph.
Note that your attempt to get concurrent processing is inefficient. This is because you cannot fit two or more models in to VRAM at the same time due to your large context window and the size of the overhead required to spread the model across all GPUs. While inference might be faster for an individual completion, you are adding latency because you are constantly loading and unloading models.
Regarding the OOMs, it's possible that the memory reporting is not as accurate as it could be. One way to test would be to control the model loading. If you add
keep_alive:0to the API call, the model will be unloaded immediately after the generation is completed. Your client then could wait a number of seconds before sending the next request. In this way you can adjust the time delay between sending requests and check to see how many seconds you need to wait for all three API calls to succeed. If this does have an effect, then it means that the memory handing in ollama needs to address this.Other alternatives:
OLLAMA_FLASH_ATTENTION=1in the server environment. Flash attention is a more efficient use of KV space and reduces memory pressure.GGML_CUDA_ENABLE_UNIFIED_MEMORY=1in the server environment. This allows the CUDA driver to use system RAM if VRAM is fully allocated. Memory allocated in this way will result in slower token generation, but if it's a small amount, the impact should be minimal.CUDA_VISIBLE_DEVICESto assign each ollama server a GPU as described here. For the llama3.1 model, it may run on 2 GPUs because it will have half of the number of copies of the memory graph, but that will need to be determined. So you have 3 ollama servers, one with 2 dedicated GPUs running llama3.1, and the other two ollama servers have one dedicated GPU running llama3.2 and glm4 respectively. Your client just needs to send each call to the appropriate server. This will give you actual concurrent processing, unlike what you are getting now.