mirror of
https://github.com/ollama/ollama.git
synced 2026-05-05 23:53:43 -05:00
Closed
opened 2026-04-22 09:35:54 -05:00 by GiteaMirror
·
9 comments
No Branch/Tag Specified
main
dhiltgen/ci
hoyyeva/editor-config-repair
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
hoyyeva/launch-backup-ux
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-mlx-decode-checkpoints
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
brucemacd/download-before-remove
parth/update-claude-docs
parth-anthropic-reference-images-path
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#30126
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @goactiongo on GitHub (Sep 21, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6902
What is the issue?
Scene One
By calling a public cloud-based LLM model through an AI Agent, two documents exceeding 2000 words each are uploaded, and the input question is: Analyze the differences between the two documents. In this manner, the model can normally analyze the differences between the two documents.
Scene Two
If the locally deployed LLM model of olalma0.3.3 is called (multiple different models have been tried), with the same documents and the same question, the model indicates that it cannot find the documents for comparison.
If the document content is reduced to around 1000 words, the model can then compare normally.
The model's maxContext and maxResponse have been adjusted from small to large, with no effect.
Scene Three: When uploading a document exceeding 2000 words and asking the ollama local model to summarize the content, the same issue arises. However, if the document is reduced to around 1000 words, the ollama local model can analyze it normally.
Despite trying multiple ollama models and adjusting maxContext and maxResponse from 2000 to 30000, the problem persists.
the log message as followed,
(base) [root@gpu ~]# journalctl -u ollama -r
-- Logs begin at 一 2024-09-02 03:24:01 CST, end at 六 2024-09-21 21:59:04 CST. --
9月 21 21:59:04 gpu ollama[48923]: [GIN] 2024/09/21 - 21:59:04 | 200 | 11.452294657s | 172.16.1.219 | POST "/v1/chat/completions"
9月 21 21:59:01 gpu ollama[48923]: time=2024-09-21T21:59:01.646+08:00 level=INFO source=server.go:623 msg="llama runner started in 6.78 seconds"
9月 21 21:59:01 gpu ollama[48923]: INFO [main] model loaded | tid="140514816995328" timestamp=1726927141
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: graph splits = 2
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: graph nodes = 1850
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: CUDA_Host compute buffer size = 41.01 MiB
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: CUDA0 compute buffer size = 578.00 MiB
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: KV self size = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB
9月 21 21:59:00 gpu ollama[48923]: llama_kv_cache_init: CUDA0 KV buffer size = 2944.00 MiB
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: freq_scale = 1
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: freq_base = 10000.0
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: flash_attn = 0
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: n_ubatch = 512
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: n_batch = 512
9月 21 21:59:00 gpu ollama[48923]: llama_new_context_with_model: n_ctx = 8192
9月 21 21:58:58 gpu ollama[48923]: time=2024-09-21T21:58:58.185+08:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm serve
9月 21 21:58:58 gpu ollama[48923]: llm_load_tensors: CUDA0 buffer size = 14898.60 MiB
9月 21 21:58:58 gpu ollama[48923]: llm_load_tensors: CPU buffer size = 922.85 MiB
9月 21 21:58:58 gpu ollama[48923]: llm_load_tensors: offloaded 47/47 layers to GPU
9月 21 21:58:58 gpu ollama[48923]: llm_load_tensors: offloading non-repeating layers to GPU
9月 21 21:58:58 gpu ollama[48923]: llm_load_tensors: offloading 46 repeating layers to GPU
9月 21 21:58:56 gpu ollama[48923]: time=2024-09-21T21:58:56.580+08:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm serve
9月 21 21:58:56 gpu ollama[48923]: llm_load_tensors: ggml ctx size = 0.45 MiB
9月 21 21:58:55 gpu ollama[48923]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes
9月 21 21:58:55 gpu ollama[48923]: ggml_cuda_init: found 1 CUDA devices:
9月 21 21:58:55 gpu ollama[48923]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
9月 21 21:58:55 gpu ollama[48923]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: max token length = 93
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: EOT token = 107 '<end_of_turn>'
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: LF token = 227 '<0x0A>'
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: PAD token = 0 ''
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: UNK token = 3 ''
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: EOS token = 1 ''
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: BOS token = 2 ''
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: general.name = gemma-2-27b-it
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: model size = 14.55 GiB (4.59 BPW)
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: model params = 27.23 B
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: model ftype = Q4_0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: model type = 27B
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: ssm_dt_rank = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: ssm_d_state = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: ssm_d_inner = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: ssm_d_conv = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: rope_finetuned = unknown
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_ctx_orig_yarn = 8192
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: freq_scale_train = 1
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: freq_base_train = 10000.0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: rope scaling = linear
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: rope type = 2
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: pooling type = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: causal attn = 1
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_expert_used = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_expert = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_ff = 36864
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: f_logit_scale = 0.0e+00
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: f_norm_eps = 0.0e+00
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_embd_v_gqa = 2048
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_embd_k_gqa = 2048
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_gqa = 2
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_embd_head_v = 128
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_embd_head_k = 128
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_swa = 4096
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_rot = 128
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_head_kv = 16
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_head = 32
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_layer = 46
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_embd = 4608
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_ctx_train = 8192
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: vocab_only = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_merges = 0
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: n_vocab = 256000
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: vocab type = SPM
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: arch = gemma2
9月 21 21:58:55 gpu ollama[48923]: llm_load_print_meta: format = GGUF V3 (latest)
9月 21 21:58:55 gpu ollama[48923]: llm_load_vocab: token to piece cache size = 1.6014 MB
9月 21 21:58:55 gpu ollama[48923]: llm_load_vocab: special tokens cache size = 108
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - type q6_K: 1 tensors
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - type q4_0: 322 tensors
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - type f32: 185 tensors
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000
9月 21 21:58:55 gpu ollama[48923]: time=2024-09-21T21:58:55.122+08:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm serve
9月 21 21:58:55 gpu ollama[48923]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "",
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 11: general.file_type u32 = 2
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 4: gemma2.block_count u32 = 46
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 1: general.name str = gemma-2-27b-it
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: - kv 0: general.architecture str = gemma2
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
9月 21 21:58:54 gpu ollama[48923]: llama_model_loader: loaded meta data with 29 key-value pairs and 508 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-d
9月 21 21:58:54 gpu ollama[48923]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="34781" tid="140514816995328" timestamp=1726927
9月 21 21:58:54 gpu ollama[48923]: INFO [main] system info | n_threads=32 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VB
9月 21 21:58:54 gpu ollama[48923]: INFO [main] build info | build=1 commit="6eeaeba" tid="140514816995328" timestamp=1726927134
9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.869+08:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm serve
9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.864+08:00 level=INFO source=server.go:584 msg="waiting for llama runner to start responding"
9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.863+08:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.863+08:00 level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama242898797/runners/
9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.862+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=47 laye
9月 21 21:58:54 gpu ollama[48923]: time=2024-09-21T21:58:54.861+08:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading
upgrate to 0.3.11 , the same issue with log as followed
9月 21 22:39:08 gpu ollama[54696]: [GIN] 2024/09/21 - 22:39:08 | 200 | 10.260377093s | 172.16.1.219 | POST "/v1/chat/completions"
9月 21 22:39:05 gpu ollama[54696]: time=2024-09-21T22:39:05.203+08:00 level=INFO source=server.go:626 msg="llama runner started in 5.67 seconds"
9月 21 22:39:05 gpu ollama[54696]: INFO [main] model loaded | tid="140149569400832" timestamp=1726929545
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: graph splits = 2
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: graph nodes = 1850
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: CUDA_Host compute buffer size = 41.01 MiB
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: CUDA0 compute buffer size = 578.00 MiB
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: KV self size = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB
9月 21 22:39:04 gpu ollama[54696]: llama_kv_cache_init: CUDA0 KV buffer size = 2944.00 MiB
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: freq_scale = 1
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: freq_base = 10000.0
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: flash_attn = 0
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: n_ubatch = 512
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: n_batch = 512
9月 21 22:39:04 gpu ollama[54696]: llama_new_context_with_model: n_ctx = 8192
9月 21 22:39:02 gpu ollama[54696]: time=2024-09-21T22:39:02.405+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm serve
9月 21 22:39:02 gpu ollama[54696]: llm_load_tensors: CUDA0 buffer size = 14898.60 MiB
9月 21 22:39:02 gpu ollama[54696]: llm_load_tensors: CPU buffer size = 922.85 MiB
9月 21 22:39:02 gpu ollama[54696]: llm_load_tensors: offloaded 47/47 layers to GPU
9月 21 22:39:02 gpu ollama[54696]: llm_load_tensors: offloading non-repeating layers to GPU
9月 21 22:39:02 gpu ollama[54696]: llm_load_tensors: offloading 46 repeating layers to GPU
9月 21 22:39:01 gpu ollama[54696]: time=2024-09-21T22:39:01.250+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm serve
9月 21 22:39:00 gpu ollama[54696]: llm_load_tensors: ggml ctx size = 0.45 MiB
9月 21 22:39:00 gpu ollama[54696]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes
9月 21 22:39:00 gpu ollama[54696]: ggml_cuda_init: found 1 CUDA devices:
9月 21 22:39:00 gpu ollama[54696]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
9月 21 22:39:00 gpu ollama[54696]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: max token length = 93
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: EOT token = 107 '<end_of_turn>'
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: LF token = 227 '<0x0A>'
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: PAD token = 0 ''
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: UNK token = 3 ''
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: EOS token = 1 ''
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: BOS token = 2 ''
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: general.name = gemma-2-27b-it
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: model size = 14.55 GiB (4.59 BPW)
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: model params = 27.23 B
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: model ftype = Q4_0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: model type = 27B
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: ssm_dt_b_c_rms = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: ssm_dt_rank = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: ssm_d_state = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: ssm_d_inner = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: ssm_d_conv = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: rope_finetuned = unknown
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_ctx_orig_yarn = 8192
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: freq_scale_train = 1
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: freq_base_train = 10000.0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: rope scaling = linear
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: rope type = 2
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: pooling type = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: causal attn = 1
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_expert_used = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_expert = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_ff = 36864
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: f_logit_scale = 0.0e+00
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: f_norm_eps = 0.0e+00
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_embd_v_gqa = 2048
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_embd_k_gqa = 2048
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_gqa = 2
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_embd_head_v = 128
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_embd_head_k = 128
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_swa = 4096
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_rot = 128
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_head_kv = 16
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_head = 32
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_layer = 46
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_embd = 4608
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_ctx_train = 8192
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: vocab_only = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_merges = 0
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: n_vocab = 256000
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: vocab type = SPM
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: arch = gemma2
9月 21 22:39:00 gpu ollama[54696]: llm_load_print_meta: format = GGUF V3 (latest)
9月 21 22:39:00 gpu ollama[54696]: llm_load_vocab: token to piece cache size = 1.6014 MB
9月 21 22:39:00 gpu ollama[54696]: llm_load_vocab: special tokens cache size = 108
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - type q6_K: 1 tensors
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - type q4_0: 322 tensors
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - type f32: 185 tensors
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "",
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.792+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm serve
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 11: general.file_type u32 = 2
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 4: gemma2.block_count u32 = 46
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 1: general.name str = gemma-2-27b-it
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: - kv 0: general.architecture str = gemma2
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
9月 21 22:38:59 gpu ollama[54696]: llama_model_loader: loaded meta data with 29 key-value pairs and 508 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-d
9月 21 22:38:59 gpu ollama[54696]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="43383" tid="140149569400832" timestamp=1726929
9月 21 22:38:59 gpu ollama[54696]: INFO [main] system info | n_threads=32 n_threads_batch=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VB
9月 21 22:38:59 gpu ollama[54696]: INFO [main] build info | build=10 commit="9225b05" tid="140149569400832" timestamp=1726929539
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.537+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm serve
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.536+08:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.536+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.534+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2642747161/runners
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.516+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=47 laye
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.514+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="111.5 GiB" free_sw
9月 21 22:38:59 gpu ollama[54696]: time=2024-09-21T22:38:59.514+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loadin
9月 21 22:38:51 gpu ollama[54696]: time=2024-09-21T22:38:51.422+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e
9月 21 22:38:51 gpu ollama[54696]: time=2024-09-21T22:38:51.422+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2
9月 21 22:38:51 gpu ollama[54696]: time=2024-09-21T22:38:51.422+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd00879
9月 21 22:38:51 gpu ollama[54696]: time=2024-09-21T22:38:51.421+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a0
9月 21 22:38:50 gpu ollama[54696]: time=2024-09-21T22:38:50.097+08:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
9月 21 22:38:50 gpu ollama[54696]: time=2024-09-21T22:38:50.097+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda
9月 21 22:38:34 gpu ollama[54696]: time=2024-09-21T22:38:34.553+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2642747161/runn
9月 21 22:38:34 gpu ollama[54696]: time=2024-09-21T22:38:34.552+08:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
9月 21 22:38:34 gpu ollama[54696]: time=2024-09-21T22:38:34.551+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
9月 21 22:38:34 gpu ollama[54696]: time=2024-09-21T22:38:34.549+08:00 level=INFO source=images.go:753 msg="total blobs: 44"
9月 21 22:38:34 gpu ollama[54696]: 2024/09/21 22:38:34 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HS
OS
Linux
GPU
Nvidia
CPU
Intel
Ollama version
0.3.3
@rick-github commented on GitHub (Sep 21, 2024):
Your log is missing useful information. Run this:
journalctl -u ollama --no-pager. How are you adjusting maxContext and maxResponse?@goactiongo commented on GitHub (Sep 22, 2024):
Here , I just use gemma-2-27b as a sample, I tried many moduel and the issue is the same .
— The 27B model was trained with 13 trillion tokens and the 9B model was trained with 8 trillion tokens.
or change to followed, the same issue
journalctl -u ollama --no-pager
9月 22 13:35:58 gpu systemd[1]: Stopping Ollama Service...
9月 22 13:36:01 gpu systemd[1]: Stopped Ollama Service.
9月 22 13:36:01 gpu systemd[1]: Started Ollama Service.
9月 22 13:36:01 gpu ollama[50713]: 2024/09/22 13:36:01 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.944+08:00 level=INFO source=images.go:753 msg="total blobs: 34"
9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.946+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.947+08:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
9月 22 13:36:01 gpu ollama[50713]: time=2024-09-22T13:36:01.950+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2397062001/runners
9月 22 13:36:17 gpu ollama[50713]: time=2024-09-22T13:36:17.632+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]"
9月 22 13:36:17 gpu ollama[50713]: time=2024-09-22T13:36:17.632+08:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="2.8 GiB"
9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="19.7 GiB"
9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="14.8 GiB"
9月 22 13:36:19 gpu ollama[50713]: time=2024-09-22T13:36:19.003+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="6.4 GiB"
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.572+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc gpu=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 parallel=4 available=21122187264 required="18.8 GiB"
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.572+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="100.7 GiB" free_swap="3.5 GiB"
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.573+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=47 layers.offload=47 layers.split="" memory.available="[19.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.8 GiB" memory.required.partial="18.8 GiB" memory.required.kv="2.9 GiB" memory.required.allocations="[18.8 GiB]" memory.weights.total="16.5 GiB" memory.weights.repeating="15.6 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="562.0 MiB" memory.graph.partial="1.4 GiB"
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2397062001/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 4 --port 38200"
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.592+08:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.593+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
9月 22 13:36:33 gpu ollama[50713]: INFO [main] build info | build=10 commit="9225b05" tid="140300405710848" timestamp=1726983393
9月 22 13:36:33 gpu ollama[50713]: INFO [main] system info | n_threads=32 n_threads_batch=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140300405710848" timestamp=1726983393 total_threads=64
9月 22 13:36:33 gpu ollama[50713]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="38200" tid="140300405710848" timestamp=1726983393
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: loaded meta data with 29 key-value pairs and 508 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc (version GGUF V3 (latest))
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 0: general.architecture str = gemma2
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 1: general.name str = gemma-2-27b-it
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 4: gemma2.block_count u32 = 46
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 11: general.file_type u32 = 2
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default
9月 22 13:36:33 gpu ollama[50713]: time=2024-09-22T13:36:33.846+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ...
9月 22 13:36:33 gpu ollama[50713]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000...
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol...
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - type f32: 185 tensors
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - type q4_0: 322 tensors
9月 22 13:36:34 gpu ollama[50713]: llama_model_loader: - type q6_K: 1 tensors
9月 22 13:36:34 gpu ollama[50713]: llm_load_vocab: special tokens cache size = 108
9月 22 13:36:34 gpu ollama[50713]: llm_load_vocab: token to piece cache size = 1.6014 MB
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: format = GGUF V3 (latest)
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: arch = gemma2
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: vocab type = SPM
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_vocab = 256000
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_merges = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: vocab_only = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_ctx_train = 8192
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd = 4608
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_layer = 46
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_head = 32
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_head_kv = 16
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_rot = 128
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_swa = 4096
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_head_k = 128
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_head_v = 128
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_gqa = 2
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_k_gqa = 2048
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_embd_v_gqa = 2048
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_norm_eps = 0.0e+00
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: f_logit_scale = 0.0e+00
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_ff = 36864
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_expert = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_expert_used = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: causal attn = 1
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: pooling type = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: rope type = 2
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: rope scaling = linear
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: freq_base_train = 10000.0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: freq_scale_train = 1
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: n_ctx_orig_yarn = 8192
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: rope_finetuned = unknown
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_d_conv = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_d_inner = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_d_state = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_dt_rank = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: ssm_dt_b_c_rms = 0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model type = 27B
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model ftype = Q4_0
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model params = 27.23 B
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: model size = 14.55 GiB (4.59 BPW)
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: general.name = gemma-2-27b-it
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: BOS token = 2 ''
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: EOS token = 1 ''
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: UNK token = 3 ''
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: PAD token = 0 ''
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: LF token = 227 '<0x0A>'
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: EOT token = 107 '<end_of_turn>'
9月 22 13:36:34 gpu ollama[50713]: llm_load_print_meta: max token length = 93
9月 22 13:36:34 gpu ollama[50713]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
9月 22 13:36:34 gpu ollama[50713]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
9月 22 13:36:34 gpu ollama[50713]: ggml_cuda_init: found 1 CUDA devices:
9月 22 13:36:34 gpu ollama[50713]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes
9月 22 13:36:34 gpu ollama[50713]: llm_load_tensors: ggml ctx size = 0.45 MiB
9月 22 13:36:35 gpu ollama[50713]: time=2024-09-22T13:36:35.304+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
9月 22 13:36:36 gpu ollama[50713]: time=2024-09-22T13:36:36.592+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: offloading 46 repeating layers to GPU
9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: offloading non-repeating layers to GPU
9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: offloaded 47/47 layers to GPU
9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: CPU buffer size = 922.85 MiB
9月 22 13:36:36 gpu ollama[50713]: llm_load_tensors: CUDA0 buffer size = 14898.60 MiB
9月 22 13:36:39 gpu ollama[50713]: time=2024-09-22T13:36:39.309+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: n_ctx = 8192
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: n_batch = 512
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: n_ubatch = 512
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: flash_attn = 0
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: freq_base = 10000.0
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: freq_scale = 1
9月 22 13:36:39 gpu ollama[50713]: llama_kv_cache_init: CUDA0 KV buffer size = 2944.00 MiB
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: KV self size = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: CUDA0 compute buffer size = 578.00 MiB
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: CUDA_Host compute buffer size = 41.01 MiB
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: graph nodes = 1850
9月 22 13:36:39 gpu ollama[50713]: llama_new_context_with_model: graph splits = 2
9月 22 13:36:39 gpu ollama[50713]: time=2024-09-22T13:36:39.562+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
9月 22 13:36:39 gpu ollama[50713]: INFO [main] model loaded | tid="140300405710848" timestamp=1726983399
9月 22 13:36:39 gpu ollama[50713]: time=2024-09-22T13:36:39.814+08:00 level=INFO source=server.go:626 msg="llama runner started in 6.22 seconds"
9月 22 13:36:42 gpu ollama[50713]: [GIN] 2024/09/22 - 13:36:42 | 200 | 10.06400965s | 172.16.1.219 | POST "/v1/chat/completions"
9月 22 13:36:46 gpu systemd[1]: Stopping Ollama Service...
9月 22 13:36:47 gpu systemd[1]: Stopped Ollama Service.
9月 22 13:36:47 gpu systemd[1]: Started Ollama Service.
9月 22 13:36:47 gpu ollama[50857]: 2024/09/22 13:36:47 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost: https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.610+08:00 level=INFO source=images.go:753 msg="total blobs: 34"
9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.613+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.614+08:00 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.11)"
9月 22 13:36:47 gpu ollama[50857]: time=2024-09-22T13:36:47.616+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2548666145/runners
9月 22 13:37:02 gpu ollama[50857]: time=2024-09-22T13:37:02.902+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2 cuda_v11]"
9月 22 13:37:02 gpu ollama[50857]: time=2024-09-22T13:37:02.902+08:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="2.8 GiB"
9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="19.7 GiB"
9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="14.8 GiB"
9月 22 13:37:04 gpu ollama[50857]: time=2024-09-22T13:37:04.255+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="6.4 GiB"
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.159+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc gpu=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 parallel=4 available=21122187264 required="18.8 GiB"
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.159+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="100.7 GiB" free_swap="3.5 GiB"
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.160+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=47 layers.offload=47 layers.split="" memory.available="[19.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.8 GiB" memory.required.partial="18.8 GiB" memory.required.kv="2.9 GiB" memory.required.allocations="[18.8 GiB]" memory.weights.total="16.5 GiB" memory.weights.repeating="15.6 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="562.0 MiB" memory.graph.partial="1.4 GiB"
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.176+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2548666145/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 47 --parallel 4 --port 42032"
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.177+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.178+08:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.178+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
9月 22 13:37:53 gpu ollama[50857]: INFO [main] build info | build=10 commit="9225b05" tid="140713765588992" timestamp=1726983473
9月 22 13:37:53 gpu ollama[50857]: INFO [main] system info | n_threads=32 n_threads_batch=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140713765588992" timestamp=1726983473 total_threads=64
9月 22 13:37:53 gpu ollama[50857]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="42032" tid="140713765588992" timestamp=1726983473
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: loaded meta data with 29 key-value pairs and 508 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-d7e4b00a7d7a8d03d4eed9b0f3f61a427e9f0fc5dea6aeb414e41dee23dc8ecc (version GGUF V3 (latest))
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 0: general.architecture str = gemma2
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 1: general.name str = gemma-2-27b-it
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 4: gemma2.block_count u32 = 46
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 11: general.file_type u32 = 2
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default
9月 22 13:37:53 gpu ollama[50857]: time=2024-09-22T13:37:53.431+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ...
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000...
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol...
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - type f32: 185 tensors
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - type q4_0: 322 tensors
9月 22 13:37:53 gpu ollama[50857]: llama_model_loader: - type q6_K: 1 tensors
9月 22 13:37:53 gpu ollama[50857]: llm_load_vocab: special tokens cache size = 108
9月 22 13:37:53 gpu ollama[50857]: llm_load_vocab: token to piece cache size = 1.6014 MB
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: format = GGUF V3 (latest)
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: arch = gemma2
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: vocab type = SPM
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_vocab = 256000
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_merges = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: vocab_only = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_ctx_train = 8192
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd = 4608
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_layer = 46
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_head = 32
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_head_kv = 16
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_rot = 128
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_swa = 4096
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_head_k = 128
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_head_v = 128
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_gqa = 2
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_k_gqa = 2048
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_embd_v_gqa = 2048
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_norm_eps = 0.0e+00
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: f_logit_scale = 0.0e+00
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_ff = 36864
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_expert = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_expert_used = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: causal attn = 1
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: pooling type = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: rope type = 2
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: rope scaling = linear
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: freq_base_train = 10000.0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: freq_scale_train = 1
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: n_ctx_orig_yarn = 8192
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: rope_finetuned = unknown
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_d_conv = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_d_inner = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_d_state = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_dt_rank = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: ssm_dt_b_c_rms = 0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model type = 27B
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model ftype = Q4_0
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model params = 27.23 B
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: model size = 14.55 GiB (4.59 BPW)
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: general.name = gemma-2-27b-it
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: BOS token = 2 ''
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: EOS token = 1 ''
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: UNK token = 3 ''
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: PAD token = 0 ''
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: LF token = 227 '<0x0A>'
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: EOT token = 107 '<end_of_turn>'
9月 22 13:37:53 gpu ollama[50857]: llm_load_print_meta: max token length = 93
9月 22 13:37:53 gpu ollama[50857]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
9月 22 13:37:53 gpu ollama[50857]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
9月 22 13:37:53 gpu ollama[50857]: ggml_cuda_init: found 1 CUDA devices:
9月 22 13:37:53 gpu ollama[50857]: Device 0: NVIDIA A30, compute capability 8.0, VMM: yes
9月 22 13:37:54 gpu ollama[50857]: llm_load_tensors: ggml ctx size = 0.45 MiB
9月 22 13:37:54 gpu ollama[50857]: time=2024-09-22T13:37:54.888+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: offloading 46 repeating layers to GPU
9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: offloading non-repeating layers to GPU
9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: offloaded 47/47 layers to GPU
9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: CPU buffer size = 922.85 MiB
9月 22 13:37:55 gpu ollama[50857]: llm_load_tensors: CUDA0 buffer size = 14898.60 MiB
9月 22 13:37:56 gpu ollama[50857]: time=2024-09-22T13:37:56.042+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: n_ctx = 8192
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: n_batch = 512
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: n_ubatch = 512
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: flash_attn = 0
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: freq_base = 10000.0
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: freq_scale = 1
9月 22 13:37:58 gpu ollama[50857]: llama_kv_cache_init: CUDA0 KV buffer size = 2944.00 MiB
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: KV self size = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: CUDA0 compute buffer size = 578.00 MiB
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: CUDA_Host compute buffer size = 41.01 MiB
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: graph nodes = 1850
9月 22 13:37:58 gpu ollama[50857]: llama_new_context_with_model: graph splits = 2
9月 22 13:37:58 gpu ollama[50857]: INFO [main] model loaded | tid="140713765588992" timestamp=1726983478
9月 22 13:37:58 gpu ollama[50857]: time=2024-09-22T13:37:58.556+08:00 level=INFO source=server.go:626 msg="llama runner started in 5.38 seconds"
9月 22 13:38:00 gpu ollama[50857]: [GIN] 2024/09/22 - 13:38:00 | 200 | 9.160086456s | 172.16.1.219 | POST "/v1/chat/completions"
(base) [root@gpu ~]#
here is AI DEBUG information
System
answer the question。
The content within "" is to be considered as your knowledge
File: test.docx
4
按面值全部或部分回售给公司。中国国际贸易中心股份有限公司 2024 年半年度报告
按税法及相关规定计算的当期所得税
228,742,593
213,975,457
递延所得税
397,078
1,899,031
合计
229,139,671
215,874,488
将基于合并利润表的利润总额采用适用税率计算的所得税调节为所得税费用:
2024 年 1 至 6 月
2023 年 1 至 6 月
利润总额
917,100,107
869,897,127
按适用税率计算的所得税
229,275,027
217,474,282
非应纳税收入涉及的所得税费用
调整额
(294,273)
(441,994)
不得扣除的成本、费用和损失涉
及的所得税费用调整额
112,836
54,741
税率差异的影响
9,216
281,022
当期未确认递延所得税资产的
可抵扣亏损
36,865
1,124,089
其他
(2,617,652)
所得税费用
229,139,671
215,874,488中国国际贸易中心股份有限公司
财务报表附注
截至 2024 年 6 月 30 日 6 个月期间
(除特别注明外,金额单位为人民币元)
四
合并财务报表项目附注(续)
41
每股收益
(1)
基本每股收益
基本每股收益以归属于母公司普通股股东的合并净利润除以母公司发行在外普通
股的加权平均数计算:
2024 年 1 至 6 月
2023 年 1 至 6 月
归属于母公司普通股股东的合并
净利润
687,537,223
653,651,170
本公司发行在外普通股的
加权平均数
1,007,282,534
1,007,282,534
基本每股收益
0.68
0.65
其中:
— 持续经营基本每股收益:
0.68
0.65
— 终止经营基本每股收益:
(2)
稀释每股收益
稀释每股收益以根据稀释性潜在普通股调整后的归属于母公司普通股股东的合并
净利润除以调整后的本公司发行在外普通股的加权平均数计算。2024 年 1 至 6
月,本公司不存在具有稀释性的潜在普通股(2023 年 1 至 6 月:不存在),因此,
稀释每股收益等于基本每股收益。
42
现金流量表项目注释
(1)
收到的其他与经营活动有关的现金
2024 年 1 至 6 月
2023 年 1 至 6 月
利息收入
21,033,279
5,074,023
保险理赔收入
16,272,954
租户违约罚款收入
1,282,587
7,411,061
政府补助
469,524
894,749
租赁押金(i)
969,319
其他
6,033,470
6,548,135
合计
45,091,814
20,897,287 中国国际贸易中心股份有限公司
财务报表附注
截至 2024 年 6 月 30 日 6 个月期间
(除特别注明外,金额单位为人民币元)
四
合并财务报表项目附注(续)
42
现金流量表项目注释(续)
(1)
收到的其他与经营活动有关的现金(续)
(i)
2024 年 1 至 6 月,本集团实际收到租赁押金 25,452,903 元,支付租赁押金
26,496,022 元,净支付租赁押金 1,043,119 元(2023 年 1 至 6 月:收到租赁押金
20,732,626 元,支付租赁押金 19,763,307 元,净收到租赁押金 969,319 元)。
(2)
支付的其他与经营活动有关的现金
2024 年 1 至 6 月
2023 年 1 至 6 月
水电采暖费
56,112,195
51,101,506
广告宣传费
22,754,073
40,494,713
保险费
5,770,956
5,536,730
租金
1,252,681
1,301,188
租赁押金(附注四.42(1)(i))
1,043,119
其他
27,348,375
22,728,332
合计
114,281,399
121,162,469
(3)
支付的其他与筹资活动有关的现金
2024 年 1 至 6 月
2023 年 1 至 6 月
偿还租赁负债支付的金额
1,203,884
1,203,884
2024 年 1 至 6 月,本集团支付的与作为承租人租赁相关的总现金流出为
2,456,565 元(2023 年 1 至 6 月:2,505,072 元),除计入筹资活动的偿付租赁负
债支付的金额以外,其余现金流出均计入经营活动。中国国际贸易中心股份有限公司
财务报表附注
截至 2024 年 6 月 30 日 6 个月期间
(除特别注明外,金额单位为人民币元)
四
合并财务报表项目附注(续)
43
现金流量表补充资料
(1)
将净利润调节为经营活动现金流量
2024 年 1 至 6 月 2023 年 1 至 6 月
净利润
687,960,436
654,022,639
加:投资性房地产折旧(附注四.8)
170,454,149
176,373,646
固定资产折旧(附注四.9)
47,846,841
48,825,698
无形资产摊销(附注四.11)
7,729,457
7,729,457
长期待摊费用摊销(附注四.12)
3,649,944
3,677,472
使用权资产折旧(附注四.10)
1,801,654
1,801,654
处置非流动资产净损失/(收益)
(附注四.38、39)
815,142
1,438,618
财务费用/(收入)(附注四.33)
28,825,210
40,904,785
投资(收益)/损失(附注四.37)
(1,177,093)
(1,767,975)
递延所得税资产减少/(增加)
(附注四.13)
397,078
1,899,031
存货的减少/(增加)(附注四.5)
329,316
1,357,482
信用减值损失(附注四.35)
28,332
(149,374)
受限资金的减少/(增加)
13,771,165
20,165,655
经营性应收项目的减少/(增加)
20,904,271
75,426,863
经营性应付项目的(减少)/增加
(40,912,808)
(39,068,402)
经营活动产生的现金流量净额
942,423,094
992,637,249
(2)
现金及现金等价物净变动情况
2024 年 1 至 6 月 2023 年 1 至 6 月
现金的期末余额
3,425,888,471
3,550,484,636
减:现金的期初余额
(3,890,169,116)
(3,326,552,300)
加:现金等价物的期末余额
减:现金等价物的期初余额
现金及现金等价物的净(减少)/增加额
(464,280,645)
223,932,336中国国际贸易中心股份有限公司
财务报表附注
截至 2024 年 6 月 30 日 6 个月期间
(除特别注明外,金额单位为人民币元)
四
合并财务报表项目附注(续)
43
现金流量表补充资料(续)
(3)
筹资活动产生的各项负债的变动情况
长期借款
(含一年内到期)
(附注四.23)
应付债券
(含一年内到期)
(附注四.22)
租赁负债
(含一年内到期)
(附注四.24)
应付股利
合计
2023 年 12 月 31 日 1,136,240,417
443,190,003
40,903,391
1,620,333,811
本期计提的利息/
股利
21,659,083
6,379,998
787,784
1,309,467,294
1,338,294,159
筹资活动产生的现
金流出
(71,724,084)
(1,203,884) (1,309,467,294) (1,382,395,262)
其中:偿还本金
(50,000,000)
(50,000,000)
支付租金
(1,203,884)
(1,203,884)
偿还利息
(21,724,084)
(21,724,084)
支付股利
(1,309,467,294) (1,309,467,294)
2024 年 6 月 30 日
1,086,175,416
449,570,001
40,487,291
1,576,232,708
(4)
现金及现金等价物
2024 年 6 月 30 日
2023 年 12 月 31 日
期/年末货币资金余额(附注四.1)
3,612,151,002
4,088,660,385
其中:库存现金
771,466
885,944
银行存款
3,555,200,837
4,033,138,169
应收利息
56,178,699
54,636,272
减:受到限制的货币资金(附注四.1)
130,083,832
143,854,997
应收利息
56,178,699
54,636,272
期/年末现金及现金等价物余额
3,425,888,471
3,890,169,116
44
外币货币性项目
2024 年 6 月 30 日
外币余额
折算汇率
人民币余额
货币资金—
美元
723,230
7.1268
5,154,316
欧元
322
7.6617
2,467
其他应付款—
美元
802,437
7.1268
5,718,808
港币
220,000
0.9127
200,794
欧元
36,580
7.6617
280,265
英镑
29,762
9.0430
269,138
上述外币货币性项目指除人民币之外的所有货币。中国国际贸易中心股份有限公司
财务报表附注
截至 2024 年 6 月 30 日 6 个月期间
(除特别注明外,金额单位为人民币元)
五
在其他主体中的权益
1
在子公司中的权益
(1)
企业集团的构成
子公司名称
主要经营地
注册地
注册资本 业务性质
持股比例
取得方式
直接 间接
国贸物业酒店
管理有限公司
北京
北京
人民币
3000 万
服务业 95%
北京国贸国际
会展有限公司
北京
北京
人民币
1000 万
服务业
2
在联营企业中的权益
(1)
不重要的联营企业的汇总信息
2024 年 1 至 6 月
2023 年 1 至 6 月
联营企业
时代网星
投资账面价值合计
20,891,146
24,538,713
按持股比例计算的净利润
725,751
1,549,010
力创智慧
投资账面价值合计
874,791
768,203
按持股比例计算的净利润
109,050
(7,277)
北京昌发展
投资账面价值合计
2,953,266
2,121,392
按持股比例计算的净利润
342,292
226,242
首程国贸
投资账面价值合计
2,000,000
按持股比例计算的净利润
合计
投资账面价值合计
26,719,203
27,428,308
按持股比例计算的净利润
1,177,093
1,767,975中国国际贸易中心股份有限公司
财务报表附注
截至 2024 年 6 月 30 日 6 个月期间
(除特别注明外,金额单位为人民币元)
六
分部信息
本集团的报告分部是提供不同服务的业务单元。由于各种业务需要不同的技术和
市场战略,因此,本集团分别独立管理各个报告分部的生产经营活动,分别评价
其经营成果,以决定向其配置资源并评价其业绩。
本集团有 2 个报告分部,分别为:
租赁及物业管理分部,负责提供物业出租、物业管理服务及会展服务
酒店经营分部,负责提供客房、餐饮等服务
分部间转移价格参照向第三方销售所采用的价格确定。资产及负债按照分部进行
分配,间接归属于各分部的费用按照受益比例在分部之间进行分配。
(1)
2024 年 1 至 6 月及 2024 年 6 月 30 日分部信息列示如下:
租赁及物业
管理业务
酒店经营
未分配的
金额
分部间的
抵销
合计
对外交易收入
1,709,030,499
256,263,022
1,965,293,521
分部间交易收入
2,059,878
4,187,445
(6,247,323)
主营业务成本
(533,033,970)
(235,769,751)
(768,803,721)
利息收入
22,361,607
214,099
22,575,706
利息费用
(28,826,865)
(28,826,865)
对联营企业的
投资收益
1,177,093
1,177,093
折旧费和摊销费
(179,723,132)
(51,758,913)
(231,482,045)
利润总额
945,909,574
(1,159,695)
(27,649,772)
917,100,107
所得税费用
(229,139,671)
(229,139,671)
净利润
945,909,574
(1,159,695)
(256,789,443)
687,960,436
资产总额
10,197,061,326
1,877,256,344
93,038,252
12,167,355,922
负债总额
1,300,281,045
95,925,031
1,688,672,716
3,084,878,792
对联营企业的长期
股权投资
26,719,203
26,719,203
非流动资产增加额(i)
20,311,781
3,092,259
23,404,040
(i)
非流动资产不包括长期股权投资和递延所得税资产。中国国际贸易中心股份有限公司
财务报表附注
截至 2024 年 6 月 30 日 6 个月期间
(除特别注明外,金额单位为人民币元)
六
分部信息(续)
(2)
2023 年 1 至 6 月及 2023 年 6 月 30 日分部信息列示如下:
租赁及物业
管理业务
酒店经营
未分配的金额
分部间的
抵销
合计
对外交易收入
1,666,402,278
271,961,862
1,938,364,140
分部间交易收入
1,842,491
3,411,616
(5,254,107)
主营业务成本
(518,526,122)
(251,494,757)
(770,020,879)
利息收入
16,829,724
244,299
17,074,023
利息费用
(40,499,642)
(40,499,642)
对联营企业的投资
收益
1,767,975
1,767,975
使用权资产折旧费
(1,801,654)
(1,801,654)
折旧费和摊销费
(183,871,459)
(52,734,814)
(236,606,273)
利润/(亏损)总额
904,979,046
3,649,748
(38,731,667)
869,897,127
所得税费用
(215,874,488)
(215,874,488)
净利润
904,979,046
3,649,748
(254,606,155)
654,022,639
资产总额
10,680,671,485
1,962,000,318
94,542,239
12,737,214,042
负债总额
(1,267,927,182)
(102,496,997)
(2,268,886,633)
(3,639,310,812)
对联营企业的长期
股权投资
27,428,308
27,428,308
非流动资产增加额(i)
23,366,594
1,525,911
24,892,505
(i)
非流动资产不包括长期股权投资和递延所得税资产。
七
关联方关系及其交易
1
母公司情况
(1)
母公司基本情况
注册地
业务性质
国贸有限公司
北京
服务业
本公司的最终控制方为国贸有限公司。中国国际贸易中心股份有限公司
财务报表附注
截至 2024 年 6 月 30 日 6 个月期间
(除特别注明外,金额单位为人民币元)
七
关联方关系及其交易(续)
1
母公司情况(续)
(2)
母公司注册资本及其变化
2023 年
12 月 31 日
本期
增加
本期
减少
2024 年
6 月 30 日
国贸有限公司
240,000,000 美元
240,000,000 美元
(3)
母公司对本公司的持股比例和表决权比例
2024 年 6 月 30 日
2023 年 12 月 31 日
持股比例
表决权比例
持股比例
表决权比例
国贸有限公司
80.65%
80.65%
80.65%
80.65%
2
子公司情况
子公司的基本情况及相关信息见附注五.1。
3
联营企业情况
联营企业的基本情况及相关信息见附注五.2。
4
其他关联方情况
与本集团的关系
中国世贸投资有限公司
母公司的中方投资者
嘉里兴业有限公司
母公司的外方投资者
香格里拉国际饭店管理有限公司(以下称“香
格里拉”)
母公司的外方投资者之关联公司
香格里拉饭店管理(上海)有限公司北京分公司
(以下称“香格里拉北京”)
母公司的外方投资者之关联公司中国国际贸易中心股份有限公司
财务报表附注
截至 2024 年 6 月 30 日 6 个月期间
(除特别注明外,金额单位为人民币元)
七
关联方关系及其交易(续)
5
关联交易
(1)
购销商品、提供和接受劳务
本公司与关联方的交易价格以市场价为基础,由双方协商确定,经本公司董事会或股东大会批准后签订相关合同。本公司董
事会认为与关联方的交易均符合正常的商业条款。
接受劳务和服务
关联方
关联交易类型 关联交易内容
2024 年 1 至 6 月
2023 年 1 至 6 月
金额
金额
国贸有限公司
接受劳务
支付餐饮费及会员费
1,199,036
906,827
国贸有限公司
接受劳务
支付劳务费
3,222,808
3,317,866
国贸有限公司
接受劳务
支付公关管理及工会行政费等
4,001,239
2,751,966
国贸有限公司
接受劳务
支付酒店营运物资、食品加工
费、洗衣费等
2,518,468
1,854,990
Humansummary the document
AI
Please provide me with the document you would like me to summarize. I need the actual text of the document in order to analyze it and create a summary for you.
For example, you can paste the text directly into our chat or provide a link to the document if it's publicly accessible online.
@rick-github commented on GitHub (Sep 22, 2024):
You are running 4 serving threads (
--parallel 4) and total context size is 8k (--ctx-size 8192), so each request is using the default context window of 2048 tokens. Whatever you are dong withmaxContextandmaxResponseis not relevant to ollama. The corresponding ollama configuration elements arenum_ctxandnum_predict, those are the parameters you need to adjust to get the documents to fit in the context window.@goactiongo commented on GitHub (Sep 22, 2024):
Thanks for your reply.
I wanna know why 'num_ctx' is not the same with my setting and why 'num_predict' is not shown
1st testing "num_ctx": 120000, "num_predict": 7000
ollama log shown:ctx-size 120000,without num_predict
2nd testing "num_ctx": 5000, "num_predict": 3000
ollama log shown:ctx-size 20000,without num_predict
``
9月 22 17:59:20 gpu ollama[57349]: time=2024-09-22T17:59:20.566+08:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama2118317042/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 --ctx-size 20000 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 4 --port 44778"
@rick-github commented on GitHub (Sep 22, 2024):
The screen shot that shows context length of 131072 is the context length that the model was trained with. This is different from
num_ctx, which is the size of the context window that ollama allocates for processing queries. The context window consumes VRAM, and if it is very large, it can cause the model weights to overflow to system RAM, making inference much slower. For this reason, the default context window that ollama allocates is smaller that the context window that the model was trained with. By default, it is 2048 tokens.1st test: the llama runner is running one thread (
--parallel 1) so the total space for context is 120000 (--ctx-size 120000).num_predictis not shown as a parameter on the command line because it is not a per-model parameter, it is a per-query parameter.num_predictis passed to the llama runner as part of the request that includes the prompt.2nd test: the llama runner is running four threads (
--parallel 4) each with a context window of 5000, so the total space for context is 20000 (--ctx-size 20000).The reason the thread count changes is because in the 1st test, ollama saw that you were asking for a very large context window, and as explained earlier, a large context can cause model weights to spill to RAM. So ollama decided to use 1 thread. In the 2nd test, you asked for a smaller context, and ollama saw that it could fit 4 threads worth of context in the available VRAM, so set
--parallel 4.You can override this behavior where ollama chooses thread count by setting
OLLAMA_NUM_PARALLELin the server enviroment. If you setOLLAMA_NUM_PARALLEL=1in the second test, the context size will be 5000 (--ctx-size 5000).@goactiongo commented on GitHub (Sep 22, 2024):
thanks
---Original---
From: @.>
Date: Sun, Sep 22, 2024 18:29 PM
To: @.>;
Cc: @.@.>;
Subject: Re: [ollama/ollama] No ollama model can recognize the referencedinformation. (Issue #6902)
The screen shot that shows context length of 131072 is the context length that the model was trained with. This is different from num_ctx, which is the size of the context window that ollama allocates for processing queries. The context window consumes VRAM, and if it is very large, it can cause the model weights to overflow to system RAM, making inference much slower. For this reason, the default context window that ollama allocates is smaller that the context window that the model was trained with. By default, it is 2048 tokens.
1st test: the llama runner is running one thread (--parallel 1) so the total space for context is 120000 (--ctx-size 120000). num_predict is not shown as a parameter on the command line because it is not a per-model parameter, it is a per-query parameter. num_predict is passed to the llama runner as part of the request that includes the prompt.
2nd test: the llama runner is running four threads (--parallel 4) each with a context window of 5000, so the total space for context is 20000 (--ctx-size 20000).
The reason the thread count changes is because in the 1st test, ollama saw that you were asking for a very large context window, and as explained earlier, a large context can cause model weights to spill to RAM. So ollama decided to use 1 thread. In the 2nd test, you asked for a smaller context, and ollama saw that it could fit 4 threads worth of context in the available VRAM, so set --parallel 4.
You can override this behavior where ollama chooses thread count by setting OLLAMA_NUM_PARALLEL in the server enviroment. If you set OLLAMA_NUM_PARALLEL=1 in the second test, the context size will be 5000 (--ctx-size 5000).
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: @.***>
@goactiongo commented on GitHub (Sep 23, 2024):
Is there something wrong with my code?
Neither "num_parallel": 2 nor "ollama_num_paralle": 2
{
"model": "glm4:9b",
"prompt": "{{qst}}:{{text}}",
"stream": false,
"options": {
"num_ctx": 5000,
"num_predict": 3000,
"num_parallel": 2
}
}
13:19:38 gpu ollama[38429]: time=2024-09-23T13:19:38.388+08:00 level=WARN source=types.go:509 msg="invalid option provided" option=num_parallel
9月 23 13:22:59 gpu ollama[38429]: time=2024-09-23T13:22:59.115+08:00 level=WARN source=types.go:509 msg="invalid option provided" option=ollama_num_parallel
@rick-github commented on GitHub (Sep 23, 2024):
num_parallelis not a valid option in an API call. You need to setOLLAMA_NUM_PARALLEL=2in the server environment.@goactiongo commented on GitHub (Sep 23, 2024):
thanks for your help