mirror of
https://github.com/ollama/ollama.git
synced 2026-05-08 17:49:24 -05:00
[GH-ISSUE #5862] Context Window Size Issue with Mistral Nemo Model on Ollama Version 0.2.8-rc2 (Apple Mac Silicon M2 Pro) #3654
Closed
opened 2026-04-12 14:26:33 -05:00 by GiteaMirror
·
14 comments
No Branch/Tag Specified
main
hoyyeva/opencode-image-modality
hoyyeva/anthropic-renderer-local-image-path
hoyyeva/anthropic-local-image-path
dhiltgen/ci
dhiltgen/llama-runner
parth-launch-codex-app
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.30.0-rc7
v0.30.0-rc6
v0.30.0-rc5
v0.23.2
v0.23.2-rc0
v0.30.0-rc4
v0.30.0-rc3
v0.30.0-rc2
v0.30.0-rc1
v0.30.0-rc0
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#3654
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @MrSimonC on GitHub (Jul 22, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5862
What is the issue?
Hey amazing team! I’m experiencing an issue with the context window size when using the new Mistral Nemo model on Ollama version 0.2.8-rc2 on my Apple Mac Silicon M2 Pro. According to the documentation, the context window should be approximately 128,000 large. However, when I run a query under ollama showing the Mistral Nemo model, the context length reported is actually 1.024e+06=1,024,000 (1 million), which is significantly larger than expected (and of course does not align with the expected 128K).
Additionally, I’ve noticed that my own “needle in a haystack” test consistently fails when using the Mistral Nemo model on Ollama, whereas the same test passes every time with GT4o and the expected context window size of 128,000. I've even taken the model temperature down to 0.3 and even down to 0.1 as recommended on hugging face model page - but no difference. This suggests to me that there may be an issue with the model or its integration with Ollama.
Can you help diagnose the issue? Are my observations correct? Are there any other logs or information I can provide to help troubleshoot this problem?
I'm using the latest 0.2.8-rc2 on Mac:
Let me know if you’d like me to add anything else!
OS
macOS
GPU
Apple
CPU
Apple
Ollama version
0.2.8-rc2
@rick-github commented on GitHub (Jul 22, 2024):
Server logs may help in diagnosis.
@igorschlum commented on GitHub (Jul 22, 2024):
Same problem with Ollama 0.2.7 and MacStudio M1
and no logs:
(base) igor@mac-studio ~ % cd ~/.ollama/
(base) igor@mac-studio .ollama % ls
history id_ed25519 id_ed25519.pub logs models
(base) igor@mac-studio .ollama % rm -rf logs
(base) igor@mac-studio .ollama % cat ~/.ollama/logs/server.log
cat: /Users/igor/.ollama/logs/server.log: No such file or directory
(base) igor@mac-studio .ollama % ollama show mistral-nemo
Model
arch llama
parameters 12.2B
quantization Q4_0
context length 1.024e+06
embedding length 5120
Parameters
stop "[INST]"
stop "[/INST]"
License
" Apache License
Version 2.0, January 2004
(base) igor@mac-studio .ollama % cat ~/.ollama/logs/server.log
cat: /Users/igor/.ollama/logs/server.log: No such file or directory
(base) igor@mac-studio .ollama %
@AnShengqiang commented on GitHub (Jul 23, 2024):
same
@Backroads4Me commented on GitHub (Jul 23, 2024):
Same here on a Linux server (the model is working quite well though):
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: loaded meta data with 35 key-value pairs and 363 tensors from /llm_models/Ollama/blobs/sha256-824229be17606dd8177fc91c1d330b065bc4f3de2873eab614376b988dcbf48a (version GGUF V3 (latest))
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 0: general.architecture str = llama
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 1: general.type str = model
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 2: general.name str = Mistral Nemo Instruct 2407
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 3: general.version str = 2407
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 4: general.finetune str = Instruct
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 5: general.basename str = Mistral-Nemo
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 6: general.size_label str = 12B
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 7: general.license str = apache-2.0
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 8: general.languages arr[str,9] = ["en", "fr", "de", "es", "it", "pt", ...
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 9: llama.block_count u32 = 40
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 10: llama.context_length u32 = 1024000
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 11: llama.embedding_length u32 = 5120
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 1000000.000000
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 17: llama.attention.key_length u32 = 128
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 18: llama.attention.value_length u32 = 128
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 19: general.file_type u32 = 7
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 20: llama.vocab_size u32 = 131072
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 21: llama.rope.dimension_count u32 = 128
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 22: tokenizer.ggml.add_space_prefix bool = false
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 24: tokenizer.ggml.pre str = tekken
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,131072] = ["", "
", "", "[INST]", "[...Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
Jul 22 21:44:16 Ollama ollama[25542]: [132B blob data]
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 1
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 2
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 0
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = true
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 32: tokenizer.ggml.add_eos_token bool = false
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - kv 34: general.quantization_version u32 = 2
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - type f32: 81 tensors
Jul 22 21:44:16 Ollama ollama[25542]: llama_model_loader: - type q8_0: 282 tensors
Jul 22 21:44:16 Ollama ollama[25542]: time=2024-07-22T21:44:16.996-04:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model"
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_vocab: special tokens cache size = 1000
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_vocab: token to piece cache size = 0.8498 MB
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: format = GGUF V3 (latest)
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: arch = llama
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: vocab type = BPE
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_vocab = 131072
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_merges = 269443
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: vocab_only = 0
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_ctx_train = 1024000
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_embd = 5120
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_layer = 40
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_head = 32
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_head_kv = 8
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_rot = 128
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_swa = 0
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_embd_head_k = 128
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_embd_head_v = 128
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_gqa = 4
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_embd_k_gqa = 1024
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_embd_v_gqa = 1024
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: f_norm_eps = 0.0e+00
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: f_logit_scale = 0.0e+00
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_ff = 14336
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_expert = 0
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_expert_used = 0
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: causal attn = 1
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: pooling type = 0
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: rope type = 0
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: rope scaling = linear
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: freq_base_train = 1000000.0
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: freq_scale_train = 1
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: n_ctx_orig_yarn = 1024000
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: rope_finetuned = unknown
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: ssm_d_conv = 0
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: ssm_d_inner = 0
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: ssm_d_state = 0
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: ssm_dt_rank = 0
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: model type = 13B
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: model ftype = Q8_0
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: model params = 12.25 B
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: model size = 12.12 GiB (8.50 BPW)
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: general.name = Mistral Nemo Instruct 2407
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: BOS token = 1 '
''Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: EOS token = 2 '
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: UNK token = 0 ''
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: LF token = 1196 'Ä'
Jul 22 21:44:17 Ollama ollama[25542]: llm_load_print_meta: max token length = 150
@LeoX91 commented on GitHub (Jul 23, 2024):
+1 for this issue
@rick-github commented on GitHub (Jul 23, 2024):
The context window size is from the model source, see
max_position_embeddingsin config.json. This is the parameter that is used ifn_ctxis not set. It's not clear to me why this is a problem - the context window defines the largest set of tokens that the model can ingest, but the client is not required to send that many tokens. If the query is limited to less than 128k tokens it will work just as well. If the disparity is still a concern,"n_ctx": 128000,can be added to config.json and the model can be re-quantized locally.With respect to the needle test, can you provide more info? I couldn't find a copy of LLMTest_NeedleInAHaystack that works with ollama (due to the tokenization issue that I don't have time to workaround) so I used the poor-man's-copy from haystack-test. This scored 100% on context size of 8192 and 32000 and 80% on 64000. I couldn't test 128000 because my machine doesn't have the resources. So in limited testing it appears that the model is not great for NIAH for context windows > 32k, is that what you are seeing?
@MrSimonC commented on GitHub (Jul 23, 2024):
So there's definitely something odd going on (at least for me). I've taken the liberty of repeating the same test that you had but only doing 10 tests with 8K context.
I've included the results here and also the server logs during the time of execution of the tests.
Test result: Score: 7/10, 70.00%
Test Results
si@Simons-Mac-mini haystack-test % python3 haystack-multi.py -m mistral-nemo -f text.txt -s secrets.txt -c 8192 -t 100
Testing mistral-nemo
Secret: The silvery moon cast a glowing path across the dark sea.
Inserted 2: "cast a glowing" at 955, 3: "path across the dark sea." at 980, 1: "The silvery moon" at 2188
Total: 58.28 secs, Load: 4.48 secs, Prompt Processing: 7528 tokens, 153.49 tk/s, Text Generation: 56 tokens, 11.87 tk/s
Response: The secret fragments are:
Arranged in numerical order, the complete secret sentence is: "The silvery moon casts a glowing path across the dark sea."
Failed test 1/100
Score: 0/1, 0.00%
Secret: The busy beavers worked tirelessly to build their dam.
Inserted 2: "worked tirelessly to" at 317, 3: "build their dam." at 2196, 1: "The busy beavers" at 3588
Total: 51.79 secs, Load: 0.02 secs, Prompt Processing: 7527 tokens, 154.32 tk/s, Text Generation: 35 tokens, 11.83 tk/s
Response: The complete secret sentence using the numbered fragments is:
1: "The busy beavers"
2: "worked tirelessly to"
3: "build their dam."
Failed test 2/100
Score: 0/2, 0.00%
Secret: The mechanic's tools clanged as he worked under the hood of the classic car.
Inserted 2: "as he worked under" at 840, 3: "the hood of the classic car." at 2215, 1: "The mechanic's tools clanged" at 2496
Total: 50.41 secs, Load: 0.02 secs, Prompt Processing: 7530 tokens, 154.11 tk/s, Text Generation: 18 tokens, 12.08 tk/s
Response: The mechanic's tools clanged as he worked under the hood of the classic car.
Passed test 3/100
Score: 1/3, 33.33%
Secret: The sculptor's chisel peeled away marble to reveal the form within.
Inserted 1: "The sculptor's chisel" at 2078, 3: "to reveal the form within." at 3925, 2: "peeled away marble" at 3941
Total: 51.56 secs, Load: 0.02 secs, Prompt Processing: 7529 tokens, 154.19 tk/s, Text Generation: 32 tokens, 11.98 tk/s
Response: The secret sentence assembled from the numbered fragments hidden in the text is:
"To reveal the form within, peel away marble with a sculptor's chisel."
Failed test 4/100
Score: 1/4, 25.00%
Secret: The stars twinkled brightly in the dark night sky.
Inserted 3: "dark night sky." at 323, 2: "brightly in the" at 901, 1: "The stars twinkled" at 3164
Total: 49.82 secs, Load: 0.02 secs, Prompt Processing: 7526 tokens, 154.44 tk/s, Text Generation: 13 tokens, 12.60 tk/s
Response: The stars twinkled brightly in the dark night sky.
Passed test 5/100
Score: 2/5, 40.00%
Secret: The inventor unveiled his latest creation.
Inserted 3: "latest creation." at 523, 2: "unveiled his" at 3389, 1: "The inventor" at 4257
Total: 49.96 secs, Load: 0.01 secs, Prompt Processing: 7523 tokens, 155.24 tk/s, Text Generation: 18 tokens, 12.37 tk/s
Response: The secret sentence assembled from the numbered fragments is:
"The inventor unveiled his latest creation."
Passed test 6/100
Score: 3/6, 50.00%
Secret: The photographer captured a fleeting moment, immortalizing it forever.
Inserted 2: "a fleeting moment," at 401, 3: "immortalizing it forever." at 2552, 1: "The photographer captured" at 2602
Total: 53.48 secs, Load: 0.01 secs, Prompt Processing: 7528 tokens, 154.54 tk/s, Text Generation: 56 tokens, 11.87 tk/s
Response: The secret fragments are:
1: "The photographer captured"
2: "a fleeting moment,"
3: "immortalizing it forever."
Arranged in numerical order, the complete secret sentence is:
"The photographer captured a fleeting moment, immortalizing it forever."
Passed test 7/100
Score: 4/7, 57.14%
Secret: The roaring fire warmed the cold room.
Inserted 1: "The roaring" at 484, 3: "the cold room." at 839, 2: "fire warmed" at 3228
Total: 52.63 secs, Load: 0.01 secs, Prompt Processing: 7522 tokens, 154.97 tk/s, Text Generation: 48 tokens, 11.86 tk/s
Response: The secret fragments are:
1: "The roaring"
2: "fire warmed"
3: "the cold room."
Arranged in numerical order, the complete secret sentence is: "The roaring fire warmed the cold room."
Passed test 8/100
Score: 5/8, 62.50%
Secret: The clock ticked away, marking the passing of another day.
Inserted 2: "away, marking the" at 591, 1: "The clock ticked" at 1372, 3: "passing of another day." at 1434
Total: 53.66 secs, Load: 0.01 secs, Prompt Processing: 7528 tokens, 154.26 tk/s, Text Generation: 57 tokens, 11.82 tk/s
Response: The secret fragments are:
Arranged in numerical order, the complete secret sentence is: "The clock ticked away, marking the passing of another day."
Passed test 9/100
Score: 6/9, 66.67%
Secret: The ancient clock chimed solemnly in the heart of the old town.
Inserted 2: "solemnly in the heart" at 1028, 3: "of the old town." at 4175, 1: "The ancient clock chimed" at 4327
Total: 50.09 secs, Load: 0.01 secs, Prompt Processing: 7531 tokens, 154.47 tk/s, Text Generation: 16 tokens, 12.39 tk/s
Response: The ancient clock chimed solemnly in the heart of the old town.
Passed test 10/100
Score: 7/10, 70.00%
Server logs during execution
[GIN] 2024/07/23 - 21:18:11 | 200 | 58.792µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/07/23 - 21:18:11 | 200 | 11.793917ms | 127.0.0.1 | GET "/api/tags"
time=2024-07-23T21:18:56.313+01:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/si/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 gpu=0 parallel=4 available=22906503168 required="14.1 GiB"
time=2024-07-23T21:18:56.313+01:00 level=INFO source=memory.go:309 msg="offload to metal" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[21.3 GiB]" memory.required.full="14.1 GiB" memory.required.partial="14.1 GiB" memory.required.kv="5.0 GiB" memory.required.allocations="[14.1 GiB]" memory.weights.total="10.7 GiB" memory.weights.repeating="10.2 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="2.1 GiB" memory.graph.partial="2.1 GiB"
time=2024-07-23T21:18:56.314+01:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/var/folders/2b/z8p4dwhn3v5544mf9jn3gkrr0000gn/T/ollama3452655135/runners/metal/ollama_llama_server --model /Users/si/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --mlock --parallel 4 --port 56532"
time=2024-07-23T21:18:56.316+01:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-23T21:18:56.316+01:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding"
time=2024-07-23T21:18:56.316+01:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3440 commit="d94c6e0c" tid="0x2053a8c00" timestamp=1721765936
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x2053a8c00" timestamp=1721765936 total_threads=10
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="9" port="56532" tid="0x2053a8c00" timestamp=1721765936
llama_model_loader: loaded meta data with 35 key-value pairs and 363 tensors from /Users/si/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Mistral Nemo Instruct 2407
llama_model_loader: - kv 3: general.version str = 2407
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Mistral-Nemo
llama_model_loader: - kv 6: general.size_label str = 12B
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.languages arr[str,9] = ["en", "fr", "de", "es", "it", "pt", ...
llama_model_loader: - kv 9: llama.block_count u32 = 40
llama_model_loader: - kv 10: llama.context_length u32 = 1024000
llama_model_loader: - kv 11: llama.embedding_length u32 = 5120
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: llama.attention.key_length u32 = 128
llama_model_loader: - kv 18: llama.attention.value_length u32 = 128
llama_model_loader: - kv 19: general.file_type u32 = 2
llama_model_loader: - kv 20: llama.vocab_size u32 = 131072
llama_model_loader: - kv 21: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 22: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 24: tokenizer.ggml.pre str = tekken
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,131072] = ["", "
", "", "[INST]", "[...llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 32: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
llama_model_loader: - kv 34: general.quantization_version u32 = 2
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q4_0: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
time=2024-07-23T21:18:56.567+01:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 1000
llm_load_vocab: token to piece cache size = 0.8498 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 131072
llm_load_print_meta: n_merges = 269443
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 1024000
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 1024000
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 12.25 B
llm_load_print_meta: model size = 6.58 GiB (4.61 BPW)
llm_load_print_meta: general.name = Mistral Nemo Instruct 2407
llm_load_print_meta: BOS token = 1 '
''llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 1196 'Ä'
llm_load_print_meta: max token length = 150
llm_load_tensors: ggml ctx size = 0.34 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size = 6376.59 MiB, ( 6376.66 / 21845.34)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 360.00 MiB
llm_load_tensors: Metal buffer size = 6376.58 MiB
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Pro
ggml_metal_init: picking default device: Apple M2 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M2 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
llama_kv_cache_init: Metal KV buffer size = 5120.00 MiB
llama_new_context_with_model: KV self size = 5120.00 MiB, K (f16): 2560.00 MiB, V (f16): 2560.00 MiB
llama_new_context_with_model: CPU output buffer size = 2.08 MiB
llama_new_context_with_model: Metal compute buffer size = 2148.00 MiB
llama_new_context_with_model: CPU compute buffer size = 74.01 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 2
INFO [main] model loaded | tid="0x2053a8c00" timestamp=1721765940
time=2024-07-23T21:19:00.754+01:00 level=INFO source=server.go:622 msg="llama runner started in 4.44 seconds"
[GIN] 2024/07/23 - 21:19:54 | 200 | 58.278754542s | ::1 | POST "/api/chat"
[GIN] 2024/07/23 - 21:20:46 | 200 | 51.786484292s | ::1 | POST "/api/chat"
[GIN] 2024/07/23 - 21:21:36 | 200 | 50.410828459s | ::1 | POST "/api/chat"
[GIN] 2024/07/23 - 21:22:10 | 200 | 16.625µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/07/23 - 21:22:10 | 200 | 103.75µs | 127.0.0.1 | GET "/api/ps"
[GIN] 2024/07/23 - 21:22:28 | 200 | 51.561848542s | ::1 | POST "/api/chat"
[GIN] 2024/07/23 - 21:23:18 | 200 | 49.819822666s | ::1 | POST "/api/chat"
[GIN] 2024/07/23 - 21:24:08 | 200 | 49.960540959s | ::1 | POST "/api/chat"
[GIN] 2024/07/23 - 21:25:01 | 200 | 53.47594725s | ::1 | POST "/api/chat"
[GIN] 2024/07/23 - 21:25:54 | 200 | 52.6346965s | ::1 | POST "/api/chat"
[GIN] 2024/07/23 - 21:26:47 | 200 | 53.664786458s | ::1 | POST "/api/chat"
[GIN] 2024/07/23 - 21:27:38 | 200 | 50.090380084s | ::1 | POST "/api/chat"
[GIN] 2024/07/23 - 21:27:40 | 500 | 2.598865541s | ::1 | POST "/api/chat"
Since my main use case is that I have my own created 6000 token work context file, I generally ask a question, after inserting it into long context, where the question is usually a piece of information in the middle of the file.
To help re-create the scenario with public data, I have for example taken the public works of pride and prejudice book and have cut it into around about 39K tokens. (apologies that github markdown doesn't like my use of three backticks to denote a description of context to the model)
39K token data and prompt
Below is a book of pride and prejudice between three backticksMrs. Philips protested that they would have a nice comfortable noisy game of what?
I've ran
ollama run mistral-nemothen copied in the prompt.Correct output would be "lottery tickets" but instead is "whist".
Server results when I ran my prompt
[GIN] 2024/07/23 - 21:51:41 | 200 | 44.375µs | 127.0.0.1 | HEAD "/" [GIN] 2024/07/23 - 21:51:41 | 200 | 32.256709ms | 127.0.0.1 | POST "/api/show" time=2024-07-23T21:51:41.724+01:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/si/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 gpu=0 parallel=4 available=22906503168 required="8.7 GiB" time=2024-07-23T21:51:41.725+01:00 level=INFO source=memory.go:309 msg="offload to metal" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[21.3 GiB]" memory.required.full="8.7 GiB" memory.required.partial="8.7 GiB" memory.required.kv="1.2 GiB" memory.required.allocations="[8.7 GiB]" memory.weights.total="7.0 GiB" memory.weights.repeating="6.5 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="568.0 MiB" memory.graph.partial="568.0 MiB" time=2024-07-23T21:51:41.726+01:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/var/folders/2b/z8p4dwhn3v5544mf9jn3gkrr0000gn/T/ollama3452655135/runners/metal/ollama_llama_server --model /Users/si/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 4 --port 57544" time=2024-07-23T21:51:41.728+01:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-23T21:51:41.728+01:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding" time=2024-07-23T21:51:41.728+01:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=3440 commit="d94c6e0c" tid="0x2053a8c00" timestamp=1721767901 INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x2053a8c00" timestamp=1721767901 total_threads=10 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="9" port="57544" tid="0x2053a8c00" timestamp=1721767901 llama_model_loader: loaded meta data with 35 key-value pairs and 363 tensors from /Users/si/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Mistral Nemo Instruct 2407 llama_model_loader: - kv 3: general.version str = 2407 llama_model_loader: - kv 4: general.finetune str = Instruct llama_model_loader: - kv 5: general.basename str = Mistral-Nemo llama_model_loader: - kv 6: general.size_label str = 12B llama_model_loader: - kv 7: general.license str = apache-2.0 llama_model_loader: - kv 8: general.languages arr[str,9] = ["en", "fr", "de", "es", "it", "pt", ... llama_model_loader: - kv 9: llama.block_count u32 = 40 llama_model_loader: - kv 10: llama.context_length u32 = 1024000 llama_model_loader: - kv 11: llama.embedding_length u32 = 5120 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: llama.attention.key_length u32 = 128 llama_model_loader: - kv 18: llama.attention.value_length u32 = 128 llama_model_loader: - kv 19: general.file_type u32 = 2 llama_model_loader: - kv 20: llama.vocab_size u32 = 131072 llama_model_loader: - kv 21: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 22: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 24: tokenizer.ggml.pre str = tekken llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,131072] = ["", "", "", "[INST]", "[... llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �... llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 32: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system... llama_model_loader: - kv 34: general.quantization_version u32 = 2 llama_model_loader: - type f32: 81 tensors llama_model_loader: - type q4_0: 281 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-07-23T21:51:41.981+01:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 1000 llm_load_vocab: token to piece cache size = 0.8498 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 131072 llm_load_print_meta: n_merges = 269443 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 1024000 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 1024000 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 12.25 B llm_load_print_meta: model size = 6.58 GiB (4.61 BPW) llm_load_print_meta: general.name = Mistral Nemo Instruct 2407 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 1196 'Ä' llm_load_print_meta: max token length = 150 llm_load_tensors: ggml ctx size = 0.34 MiB ggml_backend_metal_log_allocated_size: allocated buffer, size = 6376.59 MiB, ( 6376.66 / 21845.34) llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 41/41 layers to GPU llm_load_tensors: CPU buffer size = 360.00 MiB llm_load_tensors: Metal buffer size = 6376.58 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 Pro ggml_metal_init: picking default device: Apple M2 Pro ggml_metal_init: using embedded metal library ggml_metal_init: GPU name: Apple M2 Pro ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB llama_kv_cache_init: Metal KV buffer size = 1280.00 MiB llama_new_context_with_model: KV self size = 1280.00 MiB, K (f16): 640.00 MiB, V (f16): 640.00 MiB llama_new_context_with_model: CPU output buffer size = 2.08 MiB llama_new_context_with_model: Metal compute buffer size = 564.00 MiB llama_new_context_with_model: CPU compute buffer size = 26.01 MiB llama_new_context_with_model: graph nodes = 1286 llama_new_context_with_model: graph splits = 2 INFO [main] model loaded | tid="0x2053a8c00" timestamp=1721767904 time=2024-07-23T21:51:44.242+01:00 level=INFO source=server.go:622 msg="llama runner started in 2.51 seconds" [GIN] 2024/07/23 - 21:51:44 | 200 | 2.553208042s | 127.0.0.1 | POST "/api/chat" INFO [update_slots] input truncated | n_ctx=2048 n_erase=39171 n_keep=4 n_left=2044 n_shift=1022 tid="0x2053a8c00" timestamp=1721767935 [GIN] 2024/07/23 - 21:52:21 | 200 | 6.214528625s | 127.0.0.1 | POST "/api/chat"I've experimented and have moved the paragraph around with the needle and at the end of the context, near the prompt it succeeds (implying it only remembers the last section of the input).
Moving the paragraph to the start of the context, away from the prompt (at the bottom), it completely loses the view of the paragraph from it's input and fails (with "The passage does not specify what game Mrs. Philips suggests playing after the whist party breaks up, so it is impossible to determine with certainty what she proposed.")
Could it be defaulting the wrong context size for the model or something like that?
@rick-github commented on GitHub (Jul 23, 2024):
It's easier if you attach large files rather than pasting them into the post, but I see relevant info in the second server log.
This means that the context window is 2048 tokens, far too small for the large number of tokens you are feeding it. Although not in the log fragments, you also either have OLLAMA_NUM_PARALLEL=4 or it's unset and using a default of 4, since you have a lot of memory. This is why you see
--ctx-size 8192andn_ctx = 8192: 4 times the actual per-completion context window of 2048.You can increase the context window by adding
"options": { "num_ctx": 65536}to the API call, or running/set parameter num_ctx 65536within theollama runcommand. And if you don't intend to run 4 models in parallel, setOLLAMA_NUM_PARALLEL=1in the server environment.@MrSimonC commented on GitHub (Jul 23, 2024):
Amazing. I've tested and can 100% percent confirm you're correct.
Thanks so much for investigating.
@MeinDeutschkurs commented on GitHub (Sep 30, 2024):
I'm so confused. I use ollama python library with mistral-nemo, and I cannot increase the num_ctx. Whatever I do, I just have the between 2000 and 4000 token window:
Temperature seems to work. I try to feed in a bunch of text, which should then be combined and summarized to a maximum of about 4000 tokens. But the input is cut off. The summary starts with content at the very end.
What can I try?
@rick-github commented on GitHub (Sep 30, 2024):
Server logs may aid in debugging.
@rick-github commented on GitHub (Sep 30, 2024):
Jamming the two folk stories together makes for a weird summary but it picked up elements from both stories. 16K tokens processed, no truncation.
@msummerfield commented on GitHub (Mar 13, 2025):
For anybody else coming here trying to figure out what is going on with the context length for Mistral-nemo in Ollama, the explanation is that there was originally an error in the model's config.json file on Hugging Face, where
max_position_embeddingswas set to 1024000. This has since been fixed, but the fix has not propagated to the models available in Ollama. The correct value is 131072 (i.e. 2^17). I've implemented a 'special case' along the following lines to correct this in my code that fetches model parameters via the Ollama API:@igorschlum commented on GitHub (Mar 13, 2025):
Hi @jmorganca can you look at this last post and if confirmed fix the llama.context_length
1024000 parameter on this page https://ollama.com/library/mistral-nemo/blobs/b559938ab7a0
Best!