mirror of
https://github.com/ollama/ollama.git
synced 2026-05-06 08:02:14 -05:00
Open
opened 2026-04-22 13:30:19 -05:00 by GiteaMirror
·
30 comments
No Branch/Tag Specified
main
parth-mlx-decode-checkpoints
dhiltgen/ci
hoyyeva/editor-config-repair
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
hoyyeva/launch-backup-ux
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
brucemacd/download-before-remove
parth/update-claude-docs
parth-anthropic-reference-images-path
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
performance
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#32337
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @fanlessfan on GitHub (Mar 28, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10030
Hello,
I tried to run deepseek-r1 using CPU only with dual Xeon 6138 with 768GB memory. the result is 671b (1.74t/s) is faster than 70b (1.53t/s), even though 671b model takes a longer time. I also tried r1-1777 671b-q8 (713GB) model and it's (1.29t/s) and not slow that much.
Could anyone explain it?
thx
@rick-github commented on GitHub (Mar 28, 2025):
Multi CPU/NUMA systems can suffer from thread contention (#2936, #8074, #10022). Try reducing the thread count and see if performance changes.
@fanlessfan commented on GitHub (Mar 28, 2025):
I think there is something else affect the result. the 70b model is distilled llama 70b model and I run the 70b llama is the same as deepseek distilled 70 model, but 671b model is deepseek itself, which is faster. the deepseek 713GB model should be around half the speed of deepseek 404GB model, but only 25% slow 1.29/1.74=75%.
anyone can explain it?
@fighter3005 commented on GitHub (Mar 28, 2025):
Idk about Q4 vs Q8, but 70B should be slower than 671B, since the 671B only has 38B active paramerters. 671B takes a lot longer to load, obviously. Maybe it took longer, because it used more tokens for thinking?
When I test llama3.2-vision-11B Q4 vs Q8 I also see roughly 25% uplift. Dunno if that is normal though. (RTX 3090)
@fanlessfan commented on GitHub (Mar 28, 2025):
Is there any benefit for 70b than 671b except load faster and use less memory?
@fighter3005 commented on GitHub (Mar 29, 2025):
Again, not an Expert, but since the 671B model is HUGE, it does not fit on many systems. In that case, 70B would be beneficial. Especially, if you don't have a couple 100gb vram laying around. But since you can fit 671B in memory, I guess there is no reason to use the 70B model, other than running multiple instances or maybe longer context, etc.
@NGC13009 commented on GitHub (Mar 29, 2025):
70b是蒸馏的llama,GPU需要实打实的计算70b的各种运算。
671b是deepseek的MoE架构,推理时实际每个token激活37b左右,所以虽然它很大,但是实际运算的时候,速度是37b模型的速度,你应该对标接近37b的dense模型的速度,比如qwen:32b,就差不多了。
@navr32 commented on GitHub (Mar 29, 2025):
Do you have avx512 enable ?
@fanlessfan commented on GitHub (Mar 29, 2025):
@navr32 I just installed standard ollama on ubuntu server. Is there any command that I can check the AVX512?
@rick-github commented on GitHub (Mar 29, 2025):
@fanlessfan commented on GitHub (Mar 29, 2025):
@navr32 does the output mean avx512 enabled? thx
lscpu | grep avx512
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke md_clear flush_l1d arch_capabilities
@fanlessfan commented on GitHub (Mar 29, 2025):
@NGC13009 怎么知道推理时实际每个token激活37b。qwen 是2.83t/s。 r1-1776:671b-q4_K_M是1.8t/s。
@mrdg-sys commented on GitHub (Apr 3, 2025):
Hi,
I have some insight regarding your dual cpu system and few tricks to speed-up cpu only inference results.
My system is very similar to yours: dual 6138 xeons with avx512 and 385GB of RAM, no gpu. My ram configuration is specific for maximum memory throughput: one dimm / memory channel and 12 dimms total between 2 cpu x 6 memory channel / cpu.
With this configuration I am able to get an average of 3 tokens/second cpu only inference. This is the best result we can achieve without NUMA support from Ollama. If later this support is introduced in the software then perhaps we can reach 5 t/s with our dual cpu systems.
I did try another memory configuration where I fill every available memory dimm slot on my motherboard, hence doubling my total available ram, however this resulted in slower inference speed (down to 2 t/s) because RAM speed downgrade from 2666mhz to 2133mhz when all slots filled.
@fanlessfan commented on GitHub (Apr 4, 2025):
Hi @mrdg-sys,
I think we have the exactly same config except I have 64GBx12=768GB RAM. I have X11DPH-T motherboard. how did you get 3 tokens/s?
here is my memory speed.
Intel(R) Memory Latency Checker - v3.11b
*** Unable to modify prefetchers (try executing 'modprobe msr')
*** So, enabling random access for latency measurements
Measuring idle latencies for random access (in ns)...
Numa node
Numa node 0 1
0 88.5 142.1
1 142.2 84.1
Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 216944.3
3:1 Reads-Writes : 204116.6
2:1 Reads-Writes : 203703.3
1:1 Reads-Writes : 190054.8
Stream-triad like: 178900.1
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Numa node
Numa node 0 1
0 108828.8 51099.7
1 51098.2 108531.9
Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
00000 200.80 217166.8
00002 201.15 217055.5
00008 201.34 217105.4
00015 200.94 216880.5
00050 199.42 214537.3
00100 142.04 192391.5
00200 110.08 120579.6
00300 104.14 83600.2
00400 97.67 64017.2
00500 95.84 51905.7
00700 93.87 37740.0
01000 92.05 26926.5
01300 91.15 20951.5
01700 90.39 16274.2
02500 89.70 11345.9
03500 89.33 8325.9
05000 88.99 6057.5
09000 88.94 3694.7
20000 87.68 2068.5
thx
@mrdg-sys commented on GitHub (Apr 4, 2025):
I think I can spot your issue already it has to do with your memory bandwidth result:
ALL Reads : 216944.3
to achieve this result your motherboard's bios memory settings have by default NUMA nodes enabled therefore resulting in maximum memory bandwidth by splitting workload between dimms and cpu cores... this configuration actually slow down cpu llm inference. To improve you need to disable all NUMA settings in your motherboard bios.
My server is Fujitsu RX2530 M4 and my bios allows me to disable NUMA. In your case you may need to change NUMA from 2-way to 1-way or 0-way configurations if disable option not available.
Upon reconfiguring your bios run another memory test and look for half memory bandwidth result, so something like ALL READS ~110000
With that result you can be sure that NUMA is now disabled.
Even though your memory throughput is now halved (in theory) your actual llm inference will increase due to cpu core's unrestricted access to all available memory, not clustered.
Also for me I leave hyperthreading enabled so I can do other tasks on the server while running llm inference.
@mrdg-sys commented on GitHub (Apr 4, 2025):
by the way have you tried the latest qwq 32B Q4 (20GB) model?
I've experimented with many llm models, large and small, and thus far they all give inconsistent results when asked a control question, such as:
How many R letters are in the word strawberry?
The answer varies between 2 and 3 letters and this makes them unreliable. Its important to ask control questions following a series of llm answers to see if its on track... or has lost its mind.
The latest qwq 32B model thus far is always consistent with the answer: always 3 - correct answer.
The reason I bring this up is because this model is small compared to others and can easily fit to gpu memory for fast inference. I think there is no point dealing with large 70B and 671B models that are very slow and give inconsistent results when you can work with a quick tiny one and get correct answers.
Give it a try.
@NGC13009 commented on GitHub (Apr 4, 2025):
deepseek的论文有提供这个参数,一个专家4B大小,一次激活8个就是32B,另外还有一些专家部分以外的计算过程,加起来大概37b左右。
然而实际推理过程不一定完全等价32b左右的模型速度,这是由于实际模型推理的时候,具体用的什么显卡,怎么部署的(多卡并行怎么优化的)都会导致不同的速度。不过一般来说,显卡部署20t/s以上应该就是正常的,cpu能跑到3t/s就算很好的了。这是我的经验参数,仅供参考。
@fanlessfan commented on GitHub (Apr 4, 2025):
@NGC13009 多谢分享经验
@fanlessfan commented on GitHub (Apr 4, 2025):
Hi @rick-github,
I tried disabling NUMA and the performance is worse. even though not much. my memory speed reduced to 150GB/s instead of 100GB/s. what model did you run to get 3t/s? for 384GB memory, you can't run deepseek-r1:671b model.
I tried qwen 20GB and got 2.5t/s. I got 2.8t/s without NUMA disable.
@mrdg-sys commented on GitHub (Apr 4, 2025):
When you run ollama leave inference settings as default, don't force or specify any number of threads. By default ollama runs one thread per cpu core. Modifying this thread number will decrease performance.
Due to my ram limitation my 671B model of choise is 2.51 quant, only 220GB.
@mrdg-sys commented on GitHub (Apr 4, 2025):
" I tried qwen 20GB and got 2.5t/s. I got 2.8t/s without NUMA disable. "
Yes, thats normal because any small model that fits into single dimm module will benefit from NUMA enabled, however large models that span across many dimm modules do better with NUMA disabled.
NUMA = Non Uniform Memory Access (enabled)
UMA = Uniform Memory Access (disabled)
@fanlessfan commented on GitHub (Apr 4, 2025):
for 671b-q4, I got 1.95t/s with NUMA disable and 2.24t/s when enabled. I also monitored the real time memory speed with intel PCM and it was around 80GB/s when NUMA abled and reduced to 40GB/s when NUMA disabled.
Did you make the 2.51 quant yourself? ollama only provide 671b (404GB) model.
Yes. I leave ollama settings as default. I tried different threads on my i7-13700K (8 big, 8 small cores and 24 threads) and found 16 threads has better performance, better than default. I didn't try 671b model as it only has 96GB RAM. Is there any way we can verify the thread_number after ollama running.
@mrdg-sys commented on GitHub (Apr 4, 2025):
here is download link for 2.51 quant from ollama:
https://ollama.com/Huzderu/deepseek-r1-671b-2.51bit
or simply run from command prompt:
ollama pull Huzderu/deepseek-r1-671b-2.51bit
@fanlessfan commented on GitHub (Apr 4, 2025):
Thank you @mrdg-sys
@fanlessfan commented on GitHub (Apr 4, 2025):
I tried 671b-2.5bit model and got below result. it's still faster when NUMA enabled. and never reach 3t/s. Maybe more memory is not benefit here or the motherboard makes different.
@mrdg-sys commented on GitHub (Apr 4, 2025):
is your memory running @ 2666mhz?
@mrdg-sys commented on GitHub (Apr 5, 2025):
@fanlessfan
here are my bios settings and result from 2.51 quant with prompt:
why is the sky blue?
Perhaps its time you consider Windows 11 as the best inference platform for LLM!
@fanlessfan commented on GitHub (Apr 5, 2025):
Hi @mrdg-sys,
my memory is 2666M.
I tried windows 11 and it's different than linux for NUMA. disable NUMA make ollama faster, but not as fast as Linux enable NUMA.
I think it's hard to compare as there are so many factors. Ollama on different platform, OS different, motherboard, even memory capacity might affect this.
Thank you so much for spending time with me on this.
on Windows 11
NUMA enabled
total duration: 19m6.6652296s
load duration: 6m37.2427768s
prompt eval count: 9 token(s)
prompt eval duration: 31.786227s
prompt eval rate: 0.28 tokens/s
eval count: 1054 token(s)
eval duration: 11m57.6294909s
eval rate: 1.47 tokens/s
NUMA disabled
total duration: 12m17.4283877s
load duration: 6m5.0180974s
prompt eval count: 9 token(s)
prompt eval duration: 2.10005s
prompt eval rate: 4.29 tokens/s
eval count: 843 token(s)
eval duration: 6m10.3025057s
eval rate: 2.28 tokens/s
on Linux
No NUMA
prompt eval rate: 20.156tokens/s
eval rate: 2.147tokens/s
NUMA enabled
prompt eval rate: 20.441tokens/s
eval rate: 2.419tokens/s
@fanlessfan commented on GitHub (Apr 5, 2025):
by the way below is my memory model. and your memory might be faster than mine. you can use below link to check your memory. it has compiled windows version.
https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html
@mrdg-sys commented on GitHub (Apr 5, 2025):
Ollama does have some NUMA support on linux and that explains your inference boost with NUMA enabled. Windows situation is the opposite because no support at the moment. However my Windows inference results are better than with linux.
Perhaps its all down to our hardware...
@mrdg-sys commented on GitHub (Apr 5, 2025):
I give it a try next week when back to office