mirror of
https://github.com/ollama/ollama.git
synced 2026-05-06 08:02:14 -05:00
Closed
opened 2026-04-22 02:57:58 -05:00 by GiteaMirror
·
41 comments
No Branch/Tag Specified
main
parth-mlx-decode-checkpoints
dhiltgen/ci
hoyyeva/editor-config-repair
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
hoyyeva/launch-backup-ux
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
brucemacd/download-before-remove
parth/update-claude-docs
parth-anthropic-reference-images-path
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#26611
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @djmaze on GitHub (Dec 15, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1556
It seems as the context grows, the delay until the first output is getting longer and longer, taking more than half a minute after a few prompts. Also, text generation seems much slower than with the latest llama.cpp (commandline).
Using CUDA on a RTX 3090. Tried out
mixtral:8x7b-instruct-v0.1-q4_K_M(with CPU offloading) as well asmixtral:8x7b-instruct-v0.1-q2_K(completely in VRAM).As a comparison, I tried
starling-lm:7b-alpha-q4_K_M, which seems not to exhibit any of these problems.Sorry for the unprecise report, running out of time right now. Does anyone have a similar experience with Mixtral? Or is this expected behaviour with ollama? (First-time user here.)
@madsamjp commented on GitHub (Dec 16, 2023):
Can confirm I'm also having this issue. I'm running dolphin-mixtral:8x7b-v2.5-q5_K_M with 22 layers offloaded to GPU (RTX 4090). First response takes 2 secs, second response 26 secs, 3rd 37 secs and 4th 49 secs. By the 4th response there are 888 tokens in the context window.
Eval rate is a respectable ~10tps, but with a > 1 minute prompt eval by the 5th response, it's unusable.
@easp commented on GitHub (Dec 17, 2023):
Yeah, big issue on Apple Silicon Macs, too. I've seen references to this being a known problem for mixtral on llamacpp right now, but I can't find an actual issue about it on the llama.cpp github.
@phalexo commented on GitHub (Dec 17, 2023):
Ollama has a history file in the ~/.ollama folder. Does ollama constantly parse that cache?
@easp commented on GitHub (Dec 17, 2023):
That's just the readline history. It's just commands entered in the REPL.
@easp commented on GitHub (Dec 17, 2023):
Looks like this recently merged llama.cpp PR may improve prompt-processing speed with Mixtral: https://github.com/ggerganov/llama.cpp/pull/4480
@coder543 commented on GitHub (Dec 17, 2023):
The default
mixtralModelfile only offloads like 22 layers, as noted previously. For people with 24GB of VRAM, I have found that theq3_K_Smodel can be completely offloaded to the GPU, which speeds things up dramatically:Make a
Modelfile:Then run
ollama create mixtral_gpu -f ./ModelfileThen you can run
ollama run mixtral_gpuand see how it does.@coder543 commented on GitHub (Dec 17, 2023):
I also wonder if it would be possible for ollama to keep the eval state between prompts, rather than re-processing the entire context window for each new message. I understand ollama is trying to run a model server so there could be requests coming from more than one session at a time, but maybe it's possible to only clear the state and start from scratch if a request from a different session is received? This is all a little beyond my expertise, so I could be completely wrong.
@phalexo commented on GitHub (Dec 17, 2023):
Using llama.cpp directly in interactive mode does not appear to have any major delays. It takes merely a second or two to start answering even after a relatively long conversation.
Looks like latency is specific to ollama.
@djmaze commented on GitHub (Dec 17, 2023):
@coder543 As stated in my initial post, I even tried the
q2_kversion, loading all 33 layers into the GPU. Still, the token generation is quite slow and the delay before the token generation starts increases on every prompt as the context grows.As also stated, when using llama.cpp or a totally different model, there are no delays and the token generation (for the same model) is significantly faster.
@coder543 commented on GitHub (Dec 17, 2023):
@djmaze that is strange, since I'm not encountering any unusual problems on my 3090.
Here, there are nearly 1200 tokens in the context window of previous chat messages, and yet it is able to generate a response in less than 20 seconds. Yes, this is slower than it could be, but that seems to relate to what I mentioned in my previous comment about it not keeping the eval state between generations.
This is not the terrible performance that other people are describing, where it is taking 50 seconds with less than 900 tokens in the context window.
EDIT: testing
mistral(instead ofmixtral), I am seeing this after a similar situation:The key differentiator is that the
prompt eval rateis obviously way higher. As someone else linked to a PR which improved prompt eval rate on the CPU, it isn't crazy to assume that the prompt eval rate on the GPU needs some improvements as well. You say llama.cpp is much faster at this, but I haven't actually observed any real difference. Doing more testing now.EDIT 2: yes, using llama.cpp server, it appears to be doing exactly what I mentioned: keeping the eval state in memory. It is processing prompt tokens at the same rate as
ollama, it is just processing fewer of them because it does not appear to be re-evaluating the entire context window with each new prompt. The otherollamamodels suffer the same problems, they just seem to have a much higherprompt eval ratethanmixtral, which helps to mask it.@kaykyr commented on GitHub (Dec 17, 2023):
I can confirm same issue here, even using both 3090/4090
@djmaze commented on GitHub (Dec 17, 2023):
I just tried out
nous-hermes:70b-llama2-q2_Kin order to have a bigger model for comparison. With 51 of 81 layers offloaded to GPU, the token generation is quite slow, as expected. But I do not experience the initial delay, even when the context grows.I also tried
dolphin-mixtral:8x7b-v2.5-q4_K_M(a Mixtral finetune). It causes the same delays as I've seen withmixtral:8x7b-instruct.From this I deduce that (at least for me) the problem is specific to the Mixtral models.
@coder543 commented on GitHub (Dec 17, 2023):
@djmaze please post the verbose output. Does it not show that the number of prompt eval tokens is growing? Presumably, it just has a much more optimized prompt eval rate, as with the
mistraloutput I showed, but it should still has the same fundamental issue that it does not cache the eval state.@djmaze commented on GitHub (Dec 18, 2023):
@coder543 (Sorry I was testing with the webui before, so I didn't have any values.) After I found out how to do it, I tested the prompt eval rate of several models with olllama now (approximate values):
It seems interesting to me that although
nous-hermes:70b-llama2-q2_Khas a similar number of layers offloaded to the GPU and a much slower eval rate, it still shows a much higher prompt eval rate thanmixtral:8x7b-instruct-v0.1-q4_K_M.TL/DR You seem to be right. The mixtral prompt eval rate, at least when only partially offloaded, looks abysmal. I wonder if that is because of the MoE architecture. Or does it also depend on the quantization?
@ghost commented on GitHub (Dec 19, 2023):
I installed Ollama with the
curl ... | shcommand on WSL, and runningdolphin-mixtral:lateston 64G RAM and 4080 16G VRAM. I don't really understand anything about running this stuff, but yeah, the more I talk to the AI, the longer every reply gets delayed. Is it something that can be fixed on software level, something I can do on my end?@djmaze commented on GitHub (Dec 19, 2023):
Either way, I support @coder543's wish for a prompt eval cache. There is already an issue at #1573 for that, maybe we can continue there.
@jamesbascle commented on GitHub (Dec 19, 2023):
I used
to try getting as much of it onto my 3090 as possible and got a bit of a speedup but it is still pretty slow and only gets slower as the conversation goes on.
@phalexo commented on GitHub (Dec 20, 2023):
I was wondering if there is any indication that someone is looking into this? Also, I am wondering what effect LLAMA_CUDA_FORCE_MMQ=on setting has on the performance. If the optimized cuBLAS kernels are not used then what is the performance penalty when using MMQ kernels instead?
And why was ollama 0.1.11 and earlier working? Presumably it was using cuBLAS. What changed from 0.1.11 to 0.1.12 to make it stop working?
@gnusenpai commented on GitHub (Dec 21, 2023):
Building ollama with https://github.com/ggerganov/llama.cpp/pull/4538 and (optionally, if you do CPU+GPU inference) https://github.com/ggerganov/llama.cpp/pull/4553 has made prompt eval significantly faster for me. (~60t/s vs. ~10t/s)
@coder543 commented on GitHub (Dec 21, 2023):
For me, using llama.cpp directly, that PR appears to have raised
prompt eval rateto about 325t/s:print_timings: prompt eval time = 3444.74 ms / 1122 tokens ( 3.07 ms per token, 325.71 tokens per second)
print_timings: eval time = 5166.55 ms / 205 runs ( 25.20 ms per token, 39.68 tokens per second)
print_timings: total time = 8611.28 ms
Still not as fast as other models, but a significant improvement
@djmaze commented on GitHub (Dec 21, 2023):
But beware, it seems the quality might have dropped as a side effect: https://github.com/ggerganov/llama.cpp/issues/4572
@Confuze commented on GitHub (Dec 22, 2023):
I think I'm running into the same issue on
v0.0.17(installed from theollama-cudapackage on arch linux)When running dolphin-mixtral with
num_gpuset to 10000 just to be sure it's practically unusable, it takes the model about a minute to start responding to a single prompt in the first place and it generates the answer in a painfully slow manner. (I tried it without thenum_gpuparameter as well, no difference) According tonvidia-smiollama isn't using the gpu (rtx 2070) at all.This appears to be a problem related only to mixtral, as running others like llama2 results in perfect performance and my gpu being fully used.
By reading through this issue I understand there's not much we can do on a user's level, right? Apologies if this comment makes no sense, I know nothing about this thing just wanted to generate the recipe for meth.
@coder543 commented on GitHub (Dec 22, 2023):
@Confuze you don’t have enough VRAM to run Mixtral entirely on the GPU. ollama will be trying to load the model onto the GPU, running out of memory, and then fall back to just running on the CPU.
@phalexo commented on GitHub (Dec 22, 2023):
I don't think we should confuse two separate problems.
Sometimes there is really not enough VRAM.
Sometimes you run into cuBLAS 15 error which was introduced starting with
v0.1.12. Which also often looks like an OOM.
v0.1.11 didn't have this issue.
The only way to mitigate it, that I am aware of at the moment, is to build
with LLAMA_CUDA_FORCE_MMQ=on but this solution, as far as I know, is slower
than cuBLAS.
It really should be fixed.
On Fri, Dec 22, 2023, 2:31 PM Josh Leverette @.***>
wrote:
@Confuze commented on GitHub (Dec 22, 2023):
I see, after looking at the logs, it seems like you are right.
So, the only option I have is running this model on a cpu? (besides getting a better gpu of course) There's no way to load it partially with the gpu and partially with the cpu?
@coder543 commented on GitHub (Dec 22, 2023):
@Confuze the
num_gpuparameter that you set to 10000 was trying to force more layers onto the GPU. Mixtral has 33 layers. You just have to keep lowering that number until the VRAM usage is low enough. I would be surprised if you can fit more than 10 layers on the 8GB of VRAM that I think your GPU has. (You have to callollama createthen start a newollama runsession after each change to the Modelfile, or else the changes won't apply.) Then it will use both the CPU and the GPU. Unfortunately, offloading only a small number of layers of any model doesn't seem to give much more speed than just using the CPU, but you can try it out and see how well it works for you.@madsamjp commented on GitHub (Dec 23, 2023):
@confuze, I've successfully managed to run this model using text-generation-webui with llama.ccp.
I offload 20 layers to my 4090 with context window of 8k. I get a consistent 8-10 tps each time. This slowness issue is definitely an issue with Ollama.



@coder543 commented on GitHub (Dec 23, 2023):
@madsamjp With a 4090, you should be able to offload all 33 layers of the 3-bit quantized models and get 50+ tokens per second. If you want to run the 5-bit model, it will be slow because CPU inference of any LLM is dependent on the memory bandwidth, and outside of Apple Silicon, CPUs do not have very much memory bandwidth compared to GPUs.
I’m not connected to the ollama project, but I don’t see how this is ollama’s fault in the slightest.
Unless you’re talking about the prompt eval time issue, which was already discussed at length and is clearly a choice ollama has made not to cache the eval state between prompts. In which case, I don’t see anything new in your comment. @Confuze did not seem to be talking about the prompt eval issue at all. They were encountering slowness on the very first prompt, not subsequent prompts where the context was growing.
@madsamjp commented on GitHub (Dec 23, 2023):
@coder543 I understand that running the 5 bit model will be slow on a 4090 compared to running the 3 bit. My comment was specifically in response to this point that @confuze made: "So, the only option I have is running this model on a cpu? ". I've found that running this model using llama.cpp (with ooba), and partially offloading to gpu seems to work fine compared to Ollama, where it doesn't work without very long (and progressively worse) prompt eval times. Using Ollama, after 4 prompts, I'm waiting about 1 minute before I start to get a response. The response timing for me is not slow - about 10 tps.
My understanding of this thread was that Ollama seems to have progressively longer prompt eval times - even for models that fit entirely in VRAM. If this is because of a conscious decision that Ollama team have made, then it makes running Mixtral using Ollama unfeasible.
It seems that perhaps we are discussing separate issues in the same thread which is leading to confusion.
@iTestAndroid commented on GitHub (Dec 29, 2023):
This is my nvidia-smi output
It's responding very slowly for me. Some prompts takes 15 seconds. Any suggestions? I did the Modelfile trick with num_gpus set to 1000 but it's still doing 93 as I can see that when I run
ps -auxMy server does have 768GB RAM and 2x Xeon CPU, still it's disappointingly slow to run mixtral
@coder543 commented on GitHub (Jan 6, 2024):
FWIW, I saw that the release notes for ollama 0.1.18 mentioned "Improved performance when sending follow up messages in ollama run or via the API."
I just tested it, and it appears that ollama is now caching the eval state between prompts.
The first prompt:
The second prompt:
And, for good measure, sending a third prompt to the same chat:
In the second and third prompts, it still evaluated very few tokens for the prompt. In previous versions, it would evaluate the entire context window again with each message.
So, one of the two problems being discussed here appears to be resolved. The other issue (prompt eval rate being low for Mixtral) is still relatively unsolved.
@djmaze commented on GitHub (Jan 13, 2024):
As the latest Ollama versions are crashing or not even starting for me, I was looking for an alternative solution, at least until the current problems are solved. For people experiencing similar problems, I can warmly recommend using ExllamaV2 or, more concretely TabbyAPI. It uses ExllamaV2 as its backend.
A 3.5 bpw quant of mixtral with 4k context easily fits into a 24 GB card, even leaving a few GB for other stuff. With a 3090, I am seeing consistent eval rates of 70 tps, which is much more than I was able to achieve with Ollama / llama.cpp.
It might even be interesting to add ExllamaV2 as a backend for Ollama?
@Bearsaerker commented on GitHub (Jan 22, 2024):
This problem does not only exist with the 8x7b Mixtral version. All MoEs I tested had the initial big delay, while other models where instant. I used the fusion 2x7b q4km and the solar q5km. The Solar output was instant, while the fusion 2x7b gradually increased its delay, as the context grew
@Bearsaerker commented on GitHub (Jan 23, 2024):
With the newest pre release of ollama 0.1.21 it seems fixed. I'm sure it had something to do with llama.cpp which was updated in this release
@pdevine commented on GitHub (Mar 12, 2024):
This should be fixed now. With a 4090, I see:
Then:
And for the 3rd prompt:
Going to go ahead and close the issue.
@grafke commented on GitHub (Mar 13, 2024):
I just pulled the latest docker image ollama/ollama:0.1.29 and I'm still experiencing very long prompt eval times (with large prompts). @pdevine do you know if the fix for it is in the image? Or shall I build the ollama from the latest main?
Here are my results, the first prompt:
the second prompt
and the third prompt:
Generation is fast but the prompt eval time is suuuuper slow.
I'm using the option: "num_ctx": 32768
And running this model: https://ollama.com/grf/mixtral_wa_q4_cp (it's a quantized mixtral with an adapter) on an A100-40GB.
@pdevine commented on GitHub (Mar 13, 2024):
@grafke back-of-the-napkin math for mixtral at a 4 bit quantization is it needs about 30ish GB, but I'm not 100% sure how the context length impacts the total amount of memory required (i.e. if you're swapping) or if it's just the long context length requires that much more computation power.
My understanding is that the memory/computational resources scales quadratically as you increase the context size, so you're going to need quite a bit more memory than the 40GB. FWIW I pulled your model on my M3 128GB machine and got:
I think that's roughly tracking the speeds you're seeing?
@grafke commented on GitHub (Mar 14, 2024):
@pdevine Thanks for taking a look into it! I will try to get a A100 80GB to see if this could be resolved by increasing the memory. Indeed you're right, I'm seeing similar results.
I tested the (slow) transformers library I got the TTFT (time to first token) ~1.5 seconds (prompts are between 2k and 3k tokens long), whilst on the ollama ttft ~8-12 seconds with the same prompts (however, the total response time is 2-3x faster on the ollama).
So that got me thinking if there is something that I'm missing that makes the "slow" library to have a shorter ttft.
@pdevine commented on GitHub (Mar 14, 2024):
Interesting. Are you including model loading time in the TTFT? On my system that's about 2 seconds, although I'm not including model load time.
@pdevine commented on GitHub (Mar 14, 2024):
@grafke just thinking about that some more, you can make a call like:
Which will preload the model in memory so that when you make the next call it should be faster.
@grafke commented on GitHub (Mar 18, 2024):
@pdevine The model is preloaded in memory after I make the first request indeed (and it is slow). The requests I posted above are subsequent requests, when the model is in memory already. I guess this is due very long prompts (up to 2k-3k tokens). Shorter prompts indeed have a much shorter ttft.