mirror of
https://github.com/ollama/ollama.git
synced 2026-05-07 00:22:43 -05:00
[GH-ISSUE #10576] No: the context field is deprecated and will be removed in a future version of Ollama #6958
Closed
opened 2026-04-12 18:50:50 -05:00 by GiteaMirror
·
33 comments
No Branch/Tag Specified
main
hoyyeva/anthropic-local-image-path
dhiltgen/ci
dhiltgen/llama-runner
parth-remove-claude-desktop-launch
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.30.0-rc3
v0.30.0-rc2
v0.30.0-rc1
v0.30.0-rc0
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
feature request
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#6958
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @WizardMiner on GitHub (May 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10576
Originally assigned to: @jmorganca on GitHub.
Hi Ollama Team 👋
Thank you for all your work. Ollama has been a game-changer for local LLM development. I'm a long-time user, first time poster.
I'm writing to request that the context field in API requests not be deprecated without a viable replacement, or at least be maintained in some backward-compatible form. I understand from routes.go that it may be removed soon:
ollama/server/routes.go
For my use case (developing known context states and lookback branching), the context field is essential for controlled memory injection. It enables me to simulate a working memory buffer that persists across inference calls without relying on full transcript replay. Transcript replay does not create the same known contexts reliably.
Why this matters:
The context field allows precise semantic anchoring of prior model state across prompts.
Alternatives like appending to messages or managing the entire conversation history externally are less efficient, more error-prone and hopelessly unique.
Structured memory is a critical building block for agentic systems and intentional AI modules.
Feature Request:
If the current context implementation is slated for removal due to architecture changes or internal refactoring, could the team please consider:
Keeping context in a deprecated-but-working state until a clear replacement exists.
Offering an alternative mechanism, such as memory, sequence id or embedding_buffer field, designed explicitly for injecting reusable internal state. (Please. Allow restarting from the same contextual place is critical.)
Happy to provide more technical detail on my use case if helpful and deeply appreciate your time and consideration.
Kind Regards,
WizardMiner
@ArnarValur commented on GitHub (May 8, 2025):
I'm curious about why it would be deprecated? Changes in the dev landscape or something else?
@WizardMiner commented on GitHub (May 8, 2025):
Great question. We were wondering the same thing. Sounds like from jmorganca's discourse that there is another way to do this? I will have to research to find out more. Here is the commit that added the warning. Tried to get in touch with jmorganca to ask why, but no luck..
server: add warning message for deprecated context field in /api/generate
Commit d543b28
Guess this is what he's referring to. Maybe he thinks stream mode is equivalent?
POST /api/chat
@ArnarValur commented on GitHub (May 8, 2025):
@WizardMiner - Maybe there is some new state management on the horizon in Ollama, hoping the best for you mister Wizard. :)
@asterbini commented on GitHub (May 9, 2025):
I really hope the context is not removed.
I use it to save the chat state and try different alternative ways to ask for second and third questions to the LLM
It's repeatable, while the chat endpoint could produce a slightly different inner context
It makes easy to cache parts of the chat.
(and I wonder if it could be used as an embedding of the current chat meaning).
@perfectecologietool commented on GitHub (May 9, 2025):
I think(?) context is the tokenized version of the text. So in https://github.com/ollama/ollama/blob/main/api/types.go GenerateResponse.context (when GenerateResponse.done == true) is : // Context is an encoding of the conversation. Which means that both GenerateRequest.Prompt and GenerateResponse.Response are tokenized into their int numerical index from the model's dictionary. (You can see now that if you send one model's GenerateResponse.context to another model, it will have a totally different meaning. ) This would be confirmed by /verbose in Ollama client, that displays prompt_eval_count (the number of tokens in the prompt, and eval_count (the number of tokens together in prompt and response.) Eval_count is not the size of the response alone without the prompt. Alternatively, you can use ChatGenerate.message[], i.e where for a 'short' conversation, the ChatGenerate.messages[ GenerateRequest1.Prompt , GenerateResponse1.Response , GenerateRequest2.Prompt , GenerateResponse2.Response ] . Which are, like, human readable strings so that the same messages[] gets through to any model. The Chat Request will truncate the message tokens to the num_ctx size. (That is why eval_count and prompt_eval_count are useful, to scale your num_ctx . ) If you set temperature to 1 (or 0?), then the llama.cpp wont use random to pick tokens out of the top_k and youll get (i think) the same response for the same prompt ie ChatResponse.messages[GenerateRequest1.Prompt] each time.
@WizardMiner commented on GitHub (May 10, 2025):
Thank you, @perfectecologietool for what sounds like might be the beginnings of a work around. I hope we don't have to use it. >:-)
That makes a lot of sense though about the context being an encoding. I never really questioned it. I thought of it as a key into that LLM version's awareness of how we got there. Similar to saving a document, open it up w/ the same LLM and we can do it again and again reliably.
There was an early, spunky version of llama3.x 3b or 8b. I was able to store its context over most of a year and bring it back in over and over. It worked. It knew who I was and more or less the discourse that got us there. It's certainly possible to prep an LLM, get it up to speed on the current topic and go.
It's quite another to pull forward a context I was having a year ago back into life. So long as the LLM doesn't change, that context array is its key into it knowing particular stuff that essentially unique in the universe. Like seeding a random number generator. I know for a fact what the next number is going to be. LLM contexts are similar in that we can load with a current understanding and get a standard request/response for at least 1 turn, and then it deviates with more turns.
If I wanted to evaluate 50 students' papers or 50 queries on the health of a patient, I wouldn't want the prior 49 to taint the last one. I know you get this. The whole replay thing is so lossy.
@perfectecologietool commented on GitHub (May 10, 2025):
The difference between Generate and Chat, in the ollama client which is always Chat, is if you use /clear to clear the context .
If you dont use /clear, then the ChatResponse.messages array gets populated with all of your role==user prompts and role=assistant responses. And context size grows, i.e memory.
The model is fresh out of the box every time, but your 'context window' represents your ... context.
The model has a dictionary of tokens and 'attention head ' layers. You encode your context, into a key value pair of matrices. Then the pair are used to decode a response. To encode, a layer does linear transforming of a token from the input context with respect to the input context, according to the model's stored weights concerning Query, Key, Value. The number of embedding vector elements(dimensions) added by a particular head is the total embedding degree /number of heads. These layers produce a key matrix and value matrix to be used by the decoder. At the end of the decoder, a vector of 'logits' for all tokens in the dictionary, and a set of top_k tokens that are most likely is picked. Then the random temperature picks a token from this top_k set. If the temperature ==0 then i guess the most likely logit is picked from the top k set. https://jalammar.github.io/illustrated-transformer/
So you see that these model weights are constants and randomness comes from the 'temperature' of the very last step (like random()%top_k bounces around, as temperature bounces air particles in a balloon.)
In my tests i did for your comment, i accidently added a space or didnt clear the chat and the answer was different, but when i copy pasted the prompt, at temperature==0, the answers were word for word, bracket for bracket identical.
This month I made an html interface for ollama , using a CGI proxy made out of go lang (because browsers dont like cross domain) The Array of messages are displayed in Textarea elements so i can just copy paste.
@WizardMiner commented on GitHub (May 10, 2025):
How cool! Try this if you have time. I used llama3.2, but any will do..
Prompt1: Hello. Today we are going to talk about an orange ball. Please keep responses short.
Response1: An inanimate object. What's the purpose or context of the orange ball?
Context1: 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 271, 128009, 128006, 882, 128007, 271, 2, 20776, 311, 279, 24811, 12111, 13, 10636, 2, 1472, 527, 832, 315, 1690, 6335, 60538, 2436, 24435, 13, 5321, 387, 49150, 11, 30437, 15837, 323, 4822, 389, 8712, 13, 2591, 2, 5321, 10052, 364, 19753, 6, 61708, 420, 9306, 994, 499, 527, 5644, 13, 2591, 128009, 128006, 78191, 128007, 271, 19753, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 271, 128009, 128006, 882, 128007, 271, 12465, 287, 2038, 369, 701, 3477, 3304, 29, 22691, 13, 11450, 584, 527, 2133, 311, 3137, 922, 459, 19087, 5041, 13, 5321, 2567, 14847, 2875, 13, 128009, 128006, 78191, 128007, 271, 2127, 304, 22805, 1665, 13, 3639, 596, 279, 7580, 477, 2317, 315, 279, 19087, 5041, 30
Prompt2: It is inflated with air and bounces.
Response2: Commonly used as a toy, likely for children.
Prompt3: Oops. My sibling poured green paint all over it, now it's a green ball.
Response3: A change in color and potential loss of functionality due to the paint.
Prompt4: It doesn't bounce as well, you are right.
Response4: The green paint has compromised its intended use.
Prompt5: What color is the ball?
Response5: Green, after being painted by the sibling.
Prompt6 (New Prompt2): What color is the ball?
Context1: 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 271, 128009, 128006, 882, 128007, 271, 2, 20776, 311, 279, 24811, 12111, 13, 10636, 2, 1472, 527, 832, 315, 1690, 6335, 60538, 2436, 24435, 13, 5321, 387, 49150, 11, 30437, 15837, 323, 4822, 389, 8712, 13, 2591, 2, 5321, 10052, 364, 19753, 6, 61708, 420, 9306, 994, 499, 527, 5644, 13, 2591, 128009, 128006, 78191, 128007, 271, 19753, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 271, 128009, 128006, 882, 128007, 271, 12465, 287, 2038, 369, 701, 3477, 3304, 29, 22691, 13, 11450, 584, 527, 2133, 311, 3137, 922, 459, 19087, 5041, 13, 5321, 2567, 14847, 2875, 13, 128009, 128006, 78191, 128007, 271, 2127, 304, 22805, 1665, 13, 3639, 596, 279, 7580, 477, 2317, 315, 279, 19087, 5041, 30
Response6 (New Response2): The ball is orange, as mentioned initially.
Can you see what's going on here? It isn't aware of prompts 2, 3, 4 and 5.
I created a tangent at Prompt2, the New Prompt2, and took the conversation in a new direction. The old context with the green ball also still exists. I could also take this tangent narrative and hide the ball from my sibling, so it doesn't get green paint on it. We have two different outcomes that begin at exactly the same place. I can do this again and again. It uses whatever temperature it does to make conversational output. I don't necessarily want to lose the richness of thought.
We do need to back up to a higher point in the chain, reload the old context and go different directions. Hope this makes sense. :) Thank you!
@asterbini commented on GitHub (May 10, 2025):
If the context is just the encoded text/tokens of the message then generate+context is exactly the same than the chat endpoint, minus the tokenization step.
I thought it was the vector representation of some sort of internal semantic embedding in the network vector space.
This would have helped in caching and in comparing the semantic of different chats
@perfectecologietool commented on GitHub (May 11, 2025):
Yes this is the benefit of the context encoding - preserving the randomness of a prompt response. But that randomness is still preserved in the English response too :) Let's say you asked the LLM to color the ball, at prompt 1. The response 'chooses' orangy-auburn. This is now hardcoded into the English response. I.e it is stored in messages of chat.response. Even if it is in context too as dictionary values of tokens.
Yes,
We can say that messages[] is a 4D time stream and each message is a moment. A 5D pair of time streams is what you have described by tangent, and the key is to call 5D object a choice you the user make. Infact we should start thinking about from the top down like when a 3d space is projected onto a 2d screen, we should allow (x,y,z,t) even if always (x,y,0,0) z is always 0 on a screen. So simply, 7D is God's words we wish to be true, i e love one another as Jesus loved them, or 10 commandments. 6D is a long term plan, and 5D is a choice. While it is tempting to think of a choice as probability (choice of opening the box of Schrodinger's cat ) and 6D as a tree of choices like Einstein's light cone. Practically, 5D is a delayed alternative - a messages[] stream with an offset. Say you have formed a html interface to use Ollama. Then you have a table with 1 column whose rows' cell hold a prompt and response. 5 turns = [0,1],[2,3],[4,5][6,7][8,9] . Then say you want a choice at turn 3, so then the table has 2 columns, and turns 1 & 2 comes from column 1, but at turn 3, the second column occurs. Key word is choice, because once the choice is made you stick with it, so the second column runs its full course. 6D is like setting a choice to choose a 5D contingency, i.e if at turn 4 the ball is green then call green tool, green tool forks message column to make garden pot. If ball is orange, call orange tool to fork messages to new column to play basketball with sibling . While we may imagine a big tree of choices or table of possibles , it is executably one turn at a time in the moment, so the context (memory of the past ) is like a path on the graph that gets bigger. The Html interface helps you to record it, or wrap up the 5d 6d object with json.stringify /json.parse for later usage.
Apologies , the context given at response records the response itself, so the richness is still therr, that fluke bonanza of gold is still recorded in both messages or context , and the context or messages grow accordingly with the time of turns.
@WizardMiner commented on GitHub (Jun 1, 2025):
Hello @jmorganca and @ParthSareen. Happy June. How have you been?
Wanted to reach out and see what's going on. Posted this issue about a month ago. Can we begin the discussion to figure out where this is going or alternatives? Like us, your time is valuable too.
Thank You,
WizardMiner
@pd95 commented on GitHub (Jun 27, 2025):
Hi all
I've been stumbling over this topic over and over again because I thought it is a waste that the model cannot persist its data in a compact way. „Context" should not be deprecated... was my opinion.
If you read carefully, I write in past tense... because today I've decided to go deeper into the code to understand the details behind the deprecation and came to a better understanding.
Below are my findings:
Analysis
I started browsing the code at the location found by @WizardMiner where the deprecation notice is emitted:
4129af9205/server/routes.go (L267-L283)Looking at the surrounding code, we see that the context (passed in as an array of integers) is processed using the
Detokenizefunction. This function is found in llm/server.go at line 996 and seems to be calling into the modelsTokenToPiecefunction. Probably the reason why switching the model might make the context unusable.But continuing the lines after
Detokenizewe see that the transformed context stored insis simply added "as a String" tobwhich is then later interpreted as a huge string and stored asprompt... ?! It's all about strings? The context has been transformed back into a String!?Debugging / Observations
To validate the assumption that "everything is just text", I downloaded the source code, installed Go and added a single call at line 276:
This will output the result of the buffer after it has been „detokenised"... I expect a regular, readable string! And this is exactly what I got!
Certainly, the tokens used below "<|system|>", "<|user|>" and "<|assistant|>" depend on the model (phi4-mini in my case) but I have seen the following patterns.
The first request did not have any context. But in the second request, the following context has been added:
And in the third request, I see the following pattern:
For better understanding of the question "What is processed by the model?" I added another line at location 285
This now allows us to observe the full prompt sent to the model... and surprisingly, the pattern observed looks similar: but now beside the context we also have the next user request:
So the full prompt processed by the model consisted of the "context" (=the system message along with all user request and assistant responses) along with the new user request!
Conclusion
The
contextparameter, which seems to be a "compact" version of the past, is basically the list of past messages! Internally, Ollama is converting the token array into the String representation before it is again processed along with the user’s new request.This is why the generate API endpoint, along with the context, can be simply replaced by the chat API endpoint passing in the message history.
Also, very probably, the tokenised versions of the requests are not "more compact“ than the textual representation!
So my personal conclusion (without having yet moved over to the chat API): the
contextis not much better than the message history. So the deprecation and the later removal in an upcoming release should not be a problem. The plain request and response messages are a much better solution to conserve the "meaning" (=context), as it also allows switching a model!@perfectecologietool commented on GitHub (Jun 30, 2025):
If we look through the eyes of the creators, i think we see why the Generate endpoint was created before the Chat endpoint. First , try 1 prompt, then try a sequence of prompts. I made a html client, and first i started with a single prompt response turn. Then i wanted history. Then i wanted tools and failed, then i wanted a series of prompt response turns. Now that i can switch between models in a sequence of turns, i want to customize the prompt of a particular model, i.e to ask a model for code or json . GenerateRequest already has these exact same features such as a suffix and system (which i realize upon writing this, would function exactly like the context field - providing text before and after the prompt.)
44b17d2bfa/api/types.go (L54)(Well, there's our solution. Either Suffix replaces Context. Or generate is meant to be stand alone prompt with fancy stuff like template. And Chat for the sequences. Otherwise, if context or suffix werent depreciated, how would the generate endpoint specification say "system is pushed between prompt and context" ? )
. Here's my html client that does the painted ball sequence well enough
@jmorganca commented on GitHub (Jul 23, 2025):
Hi all, sorry for the delayed response. The challenge with the context parameter is that it includes a bunch of tokenized text which is model-specific. As a further issue it often doesn't include the stop or control tokens, meaning the conversation may degrade significantly when using
context.An alternative is to simply modify the messages or input prompt as needed – which will be handled in a much better way than modifyingcontext(which was not designed to be modified). Hope this helps and sorry to deprecate a parameter.@WizardMiner commented on GitHub (Jul 23, 2025):
Hi @jmorganca and thank you for getting back with us. I cannot speak to others' comments on this incident. You and others have indicated a simple solution. That's fantastic. I personally do not understand how force-feeding megabytes of text into a prefix prompt will accomplish anything but cause more paging. Can't wait to see how that will work.
To your points..
The challenge with the context parameter is that it includes a bunch of tokenized text which is model-specific.
..right. Exactly what we want. Model specific context to reset the LLM to a known anchor point. This is a feature, not a bug. Please do not remove it.
As a further issue it often doesn't include the stop or control tokens, meaning the conversation may degrade significantly when using context.
..I kept an old Llama3.1 3B around for over a year with newer Ollama instances along the way. Would revive it from time to time and catch it up w/ the latest news. It's personality and knowledge of conversations using the context array did not disappear nor degrade. Provided the LLM is the same, should be able to jump back to the semantic anchor anytime in the future. This works. Please don't remove it from the platform.
An alternative is to simply modify the messages or input prompt as needed – which will be handled in a much better way than modifying context (which was not designed to be modified).
..looking forward to learning how to do this. Lets say I have 1000 turns in a conversation. And I want to go back to 883 to take a tangent in a different direction (883->1000 tangent one still exists). Currently, today, as Ollama is now, I can simply load up the context array from turn 883 and continue in a new direction. Easy. Already done. Works great. (edited: We're not trying to modify context. We're copying the context that's returned and re-using is later and out of band. But it's not being modified.)
(Import question follows..)
How exactly are we supposed to modify my second 883 prompt with the context up to the turn of 882? Please explain/show us how this is identical in generate. We'll all gush over the answer and sing your praises. But that really doesn't make sense to me to feed megabytes into a text prompt so I can rephrase a single word in prompt. What am I missing?
..Thank you for getting back with us. It sounds to me as though one of us doesn't understand the problem (probably me). But if its so simple and obvious for you and everyone to get back to turn 883 with the correct context, that should be easy to demonstrate. Just show/tell us how to do it and we'll prove it out and get back with you.
Please don't let this go w/o a work around or obvious solution. The only other work around we've come up with is to fork the Ollama project and maintain Ollama-with-Context ..which really sucks. Don't want to do that at all. Just can't understand how you all can modify a prompt with half the contents of the library of congress and keep a straight face. Don't believe it'll work. But I'm excited at the prospect of being proven wrong.
Please get back with us. We need that parameter or a solid work-around, for real.
Many Thanks,
WizardMiner
@pd95 commented on GitHub (Jul 23, 2025):
@WizardMiner : Try doing the Debugging / Observations in my response above. You will see that what you call "context" is basically the tokenized history of messages. It is effectively the 1000 messages. So if you "load the context" with the old generate API, then Ollama is effectively decoding the context into plain text. Dumping this "plain text" to the log (as I did in my Debugging mentioned above), shows that you every single message you have sent to the LLM.
You think this context is a more condensed form - an internal representation of what the LLM understood. No it isn't... it's literally every message and response in the same order of the conversation.
@WizardMiner commented on GitHub (Jul 23, 2025):
@pd95 thank you. And I read your analysis and understood your comments mostly. Maybe I need to do it for myself. But I don't think you or @jmorangca understands. Imagine I'm chatting away with an LLM for several days or weeks and I've pumped in megabytes of data as prompts. You're telling me all that data is tokenized in the context? How? Infinite memory? I don't think it retains everything given to it. For example..
https://dailymed.nlm.nih.gov/dailymed/lookup.cfm?setid=63b36274-89f0-42d8-9f09-f9e78e179af4
Say I feed it 10 meds like that one to discuss a particular patient. You thinking all that is tokenized and stored in the context array? Really? So what the context size is infinite?
Please explain and thank you.
WizardMiner
@WizardMiner commented on GitHub (Jul 23, 2025):
@rick-github what are you doing? It is not a closed issue. Until today we didn't know we could fire hose 300 GB into an Ollama prompt. Now we do. This is new and prescient.
@rick-github commented on GitHub (Jul 24, 2025):
At no point has anybody suggested that you can fire hose 300GB into an ollama prompt, so let's do away with the hyperbole.
pd95 is correct in that the context is just the tokenized versions of previous prompts. If you have been feeding in large medical reports and discussing a particular patient, then I would not recommend following any of the advice emitted by the model. If the prompt is larger than the context, then part of the prompt is discarded. So if you have a context window of 20k tokens, and each report takes 5k tokens, then only the last 4 reports of the 10 that you are feeding in will be considered by the model.
In order to keep the context history that you are looking for, you can use the chat interface, keep the message history, and pass in the message list plus any new prompts you want to process. If you want to roll back to a previous prompt to correct the spelling of "swalowed", then you limit the message history up to that prompt.
The only advantage that
contextprovides is that the tokenization step is skipped, reducing processing time. This is small compared to the time taken for LLM processing, although I don't have a quantitative comparison.@WizardMiner commented on GitHub (Jul 24, 2025):
Sounds like Ollama is going the way of AOL. Good luck to them.
Haven't tested these others yet...
vLLM
LMDeploy
KoboldCPP
(based on Llama.cpp)
OllamaWithContext
Comming soon: Copy of latest Ollama with Context parameter intact.
(use as last resort because they are going off on their own and don't play well with others)
@WizardMiner commented on GitHub (Jul 24, 2025):
@rick-github thanks for your reply. Yall are talking in circles.
..and if it took 300 GB to get there, you seem to be implying it's fine to pump that 300 GB of chat back into the prompt to catch up a new model in order to make a 1 word change on the next prompt. This is not what's going on in the context. I think yall are making up stuff and throwing it at the wall to see if it sticks. If what you are saying is true, then the context would grow to gigabytes in size given gigabytes of prompts. ..well.. does it?
No. It does not. The context is a rolling sense of where the convo is. Not sure how exactly it works, but I know for a fact the arrays are not gigabytes in size like you are implying. idk why yall are being so obstinate. Wish you could come up with a valid / reproducible workaround rather than belittling us. Whatever. Have it your way.
@rick-github commented on GitHub (Jul 24, 2025):
The
contextfield does not encode 300GB of previous prompts. It encodes the parts of the conversation that fit in the context buffer. When you replay the message history, you don't replay 300GB of history, you replay the amount that fits in the context buffer.@WizardMiner commented on GitHub (Jul 24, 2025):
Finally, somebody else understands. Thanks @rick-github! That's what we've been trying to tell you guys. Replaying the messages will not get you back to the same place accurately. idk why that's so hard to convey. Now, can you please explain that to @jmorganca and @pd95?
I took all the content for this page and posted in a single LLM chat..
Skip to content
Navigation Menu
ollama...
Which wound up being 5399 words. When I pipe through llama3.2 via Ollama, the context returned is like..
..for 8963 tokens. This seems so simple to me, it's difficult to explain.
5000 < (means less than) 8000.
Conversely, 8000 > (means greater than) 5000.
So if the context is just a tokenized bunch of messages WHY ARE THERE 3000 NEW ONES??????
Just feel like yall are blowing us off and not taking this question seriously. Look, we need this feature that you all are deprecating and telling us a line of bullshit about it. If you are so sure your method works, prove it. But don't kneecap us until you do. Show us how and we'll sing your praises. We just want it to work. idc how it's exposed. If you would like, I can show you a demo of why and how I need it.
Thanks in advance.
WizardMiner
@WizardMiner commented on GitHub (Jul 24, 2025):
Really trying to understand this. Yes, 300 GB is hyperbole. Trying to get across it's too much information to convey in a prompt.
Here's a form of generate..
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt":"Why is the sky blue?"
}'
Are you saying that the value for prompt could be generated as such to achieve the same result as context array?
var prompt = prior_prompt[830] + "\n" + prior_response[830] + "\n"
prior_prompt[831] + "\n" + prior_response[831] + "\n"
prior_prompt[832] + "\n" + prior_response[832] + "\n"
prior_prompt[833] + "\n" + prior_response[833] + "\n"
prior_prompt[834] + "\n" + prior_response[834] + "\n"
prior_prompt[835] + "\n" + prior_response[835] + "\n" + "Why is the sky blue?";
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": prompt
}'
..or to that effect and Ollama ignores the text that is too big for it to keep in context? ..and that's all there is to it? I already have arrays of these and could access/repurpose them. Seems like a strange way to make a prompt. Maybe because I'm used to the chat API.
Thanks in advance. Sorry for being difficult.
WizardMiner
@WizardMiner commented on GitHub (Jul 25, 2025):
Not sure if this will make sense. Early on multiple LLM are given the same information. The context arrays that are returned are used to the determine radius of the spheres and dot distribution. This is what made me think it's not just tokenized history. Would've expected similar sized spheres if that's all it was. Mistral is always small, Hermes is always big, and Qwen is in the middle. ..always. I don't know how we would get this information if not from the contexts.
@ioquatix commented on GitHub (Jul 25, 2025):
What is the alternative to using context?
@WizardMiner commented on GitHub (Jul 25, 2025):
.. right.
@pd95 did a fantastic job of explaining it above. Although, i didn't understand it at the time. Quite a bit of research later and it makes more sense now.
<|system|>System message<|user|>Request 1<|end|><|assistant|>Response 1<|user|>Request 2<|end|><|assistant|>Response 2I also found this in the help.. Chat Request with history..
But these are not the same thing as an integer array. Please keep context around for us. That's a very handy parameter.
@rick-github commented on GitHub (Jul 25, 2025):
Short answer: yes. Long answer:
A token is the smallest unit of information that a model deals with. It's a number that represents an encoding of some of the characters of text input to the model. It can represent a word, like 15538 for "hello", or portions of a word, like the tokens 38766 and 1303, which represent "Cut" and "ting" respectively, and can be concatenated to form the word "Cutting". Token mappings are specific to a model.
Tokens go in the context buffer. During inference, they are collectively fed through the model weights to generate a new token, which is appended to the context buffer. This continues until the model generates an end-of-sequence token or the inference is interrupted by other mechanisms - client disconnect,
num_predictreached, etc. For inference, only the tokens in the buffer matter.The initial tokens in the buffer are a result of a call to
/api/generateor/api/chat. These take text, wrap it in special character sequences, convert the resultant string to tokens, and place them in the buffer. The process of adding special character sequences is handled by the template. The added sequences give guidance to the model on how to interpret the text: which part is a system message, which part is a tool list, which part is a user query, etc. Note thatgenerateandchatperform the same function - convert text to tokens - they just have slightly different calling semantics.generatealso has the ability to insert tokens that are the result of a previous text-to-token conversion, using thecontextfield. These are used as the initial set of tokens, and then the value ofpromptis wrapped in special sequences, converted to tokens, and appended to the previous tokens.When the set of tokens has been generated, it is placed in the context buffer. If the number of tokens exceeds the length of the context buffer, tokens are removed from the head of the token string to allow it to fit in the buffer. If, during inference, the context buffer reaches capacity, tokens are removed from the head of the buffer and the buffer is shifted to make room for new tokens. Since the tokens are removed, they no longer contribute to the process of creating new tokens.
The effect of template processing can be seen in the logs with
OLLAMA_DEBUG=2. For example:We can achieve exactly the same prompt using the
chatinterface:The result of the text-to-token process can be seen by de-tokenizing the returned
contextfield.detokenize.py
Here we see that the returned
contextcontains special character sequences, the system message, the prompt and the response. We can now send that back to the model with an additional prompt and see how that affects the growing context.We see that the returned context contains multiple copies of the system message as well as the special character sequences. This is visible in the logged prompt:
It is much more efficient to use a
chat-style message list.Preserving message history in
contextalso presents problems when the context grows larger than the context buffer can support. With amessageslist, the server can prune messages from the list to enable the tokens to fit in the context buffer. For example, assume the message history looks like this:Now we want to ask the assistant the colour of the ball, so a new message is appended to the list:
{"role":"user","content":"What color is the ball?"}. If the combined length of the tokens from this prompt is larger than the context window, the ollama server will remove messages to make it fit. It will start with the oldest user message (Welcome...) and successively remove messages until the tokenized prompt will fit in the context buffer. So the end result may be:Which looks like this after template processing:
In contrast, if we use the stream of tokens from the saved
context, the ollama server has no clues as to the function of the tokens. In order to make the tokens fit, it will just remove tokens from the head of the list. So the end result may be missing system instructions, incomplete special character sequences or partial responses. Take your ball example from earlier in this thread:We can simulate buffer truncation by a lowering context length with
num_ctx:Despite the colour of the ball being mentioned in the remaining context (token number 127, 19087), the random loss of leading tokens confuses the model. Without the selective message removal available via
chat, the token stream is at the mercy of random truncations.In order to meet your requirement of replayability, the
messageslist affords more flexibility. Keeping the message list in plain text allows rollback to intermediate results without having to de-tokenize the saved context. Additionally, a message list can be replayed into a different model, unlike a saved context since the tokens are specific to a model.If the message list is large (300GB), then the replay has to be selective about what part of the message list to use. Technically, the entire message list could be fed into a model, and ollama will remove messages as discussed above. It's more efficient to select the point up to which the replay is to restart, wind back a few messages in order to populate the context buffer, and then start inference. So in the example above where there are 835 request/response pairs and the response to request 412 needs to be re-examined, request/response pairs 400-411 and the request from 412 would be constructed as a message list to be sent to the model. Messages prior to 400 will not fit in the context buffer and so do not influence the response generated to message request 412.
The
contextfield does reduce processing because there is no text-to-token step for that portion of the token string. In practice, the amount of processing saved is lost in the jitter of prompt processing and is insignificant compared to the processing required for inference.@ioquatix commented on GitHub (Jul 25, 2025):
Wow, thanks for the awesome explanation, that is extremely helpful in my understanding of how everything fits together.
My mistake was assuming context was some kind of "intelligently compressed state", but it's just the history. I agree the truncation issue makes more sense when dealing with complete messages.
Appreciate the time you took to explain it all.
@WizardMiner commented on GitHub (Jul 26, 2025):
..same. I was convinced the context was situational memory. ..and really appreciate the time it took to explain over and over. Have an idea how to accomplish most of what I need with the message[] pattern. It's cool idea but ultimately does not match the capability of context[].
Considering the LLM tokenize with their own awareness, the context array has information we can't get any other way. As they are indicating, without it Ollama will be like everyone else. ..OpenAI, Claude, Gemini don't support context arrays. Why should Ollama? Product differentiation is so overrated. With the context array, Ollama is special and unique. Guess they decided in an engineering meeting what they want to be instead.
Feels like Commodore scenario or following a bunch of friends jumping off the bridge just because they did it. Had I been an engineer on the team, this decision would have been thoroughly vetted and we'd have great reasons to explain ourselves to our users. ..or more likely not deprecate a critical feature. (Hey Team Ollama Context Arrays are useful. Don't delete it. ..see, no luck) It's like they have no concept of memory constraints, bandwidth, privacy and see no harm in back tracking. You're going to piss off users you won't get back.
But, I'm just a user too and this is pretty much all we can do. If anybody follows here and happens to have back channel connections into Ollama Team Members, please let them know what a critical mistake it is to deprecate the context array. Really bad move. Like giving props to TRS-80 or Sinclair for being amazing while everyone knows its going the way of the Dodo. Reject your users and you'll figure it out later.
ChatGPT, being far more eloquent and magnanimous than I, puts it this way..
🔥 What Does
context[]Actually Do in Ollama?It acts as a compact representation of the model’s internal memory state, specifically:
context[]📉 Consequences of Losing
context[]context[]context[]is ~10x smaller than repeatedmessages[]strings🔬 Technical Trade-offs
context[]context[](messages[]only)messages[]grows per turn)🧠 Concrete Use Cases That Break or Weaken
1. Branching / Forking
You currently:
context[]at turn 4With only
messages[]:2. Performance Over Time
As a conversation goes on:
context[]: stays relatively compact (fixed size token buffer)messages[]: grows indefinitely, even if turns are short3. Real-time Interfaces / Streaming
context[], you can stream responses continuously from a known base state🔮 What You Lose That You May Not See
context[]✅ Summary
context[](present)context[](future)🧙 Recommendation
context[]as long as it’s available.context[]in logs so you can analyze model divergence across time or across engines.Let me know if you want a table for visually comparing
context[]-based sessions tomessages[]-based ones — the application could learn to infer the loss signature...yeah, not sure how true all that is, but I couldn't have said it any better. If you're going to remove context, at least give us:
– A streaming-friendly message compression format
– A stable token replay seed
– Fork-resumable inference metadata
You have to see that passing the entire history back and forth as text is the worst. But what do a bunch of dumb users and the most recognizable super intelligent AI on the planet know? Team Ollama is calling the shots here no matter what we might think. My final thought for now is that this really sucks.
@rick-github commented on GitHub (Jul 26, 2025):
It does not. The
contextfield is the tokenization of the prompt. There is no additional "awareness" encoded in it.I don't normally reply to AI slop, but let me add a few counterpoints.
Performance is minutely impacted by re-tokenization of text. Capability is not impacted, although the client has to adjust how history is managed.
The
messageslist also represents all tokens seen so far.The
messageslist restores attention cache. It does require tokenization, so you can argue thatcontextis 0.00001% faster.Using the
contextis replaying. It's just replaying the tokens instead of the text.??
Replaying the
messageslist is just as deterministic is replaying the tokens.This is the same as sending the entire
contextfield.There is no persistent KV cache. The KC cache is populated by the
chatorgeneratecall and updated during inference.There is compression (more like 4x) if you store the
contextas a string if integers. If you are storingcontextas the string returned in the API call, you have negative compression - yoi are using more space than if you stored themessageslist.No.
Forking is easier, since you can choose the message you want to fork at. The token list from
contextincludes multiple redundant tokens and wastes context buffer space.??
Yes, the
contextfield is removed.Yes, the prompt needs to be re-processed, which is insignificant processing overhead.
Token cache usage is equivalent.
Since the tokens are sent as string values,
contextis a larger payload than the equivalentmessagesstructure.??
Forking is easier as disussed above.
No. Same tokens, same interpretation.
Sending
context, which is a list of tokens generated from prompts, is the same as sending the prompts and having the server tokenize them.The
contextwill increase as tokens are added. I know LLMs are bad at maths but this is getting ridiculous.The contents of
messagesheld in the client increases to maintain history, but the subset of messages from that list sent to the client is constant - just enough to fill the context buffer.??
Latency will increase 0.00001% due to the re-tokenization. Since they are the same tokens, hallucination risk stays the same.
No.
??
No.
Almost none of it.
@ioquatix commented on GitHub (Jul 26, 2025):
If I'm understanding the discussion so far, passing
contextis actually worse than passing a structured list of messages, becausecontextlacks semantic information — like roles or message boundaries — that are needed to properly age out or trim historical data. So if the context window is exceeded, it leads to abrupt and unstructured truncation, almost like giving the model a lobotomy.In other words, there is more information stored in the message stream, and the
contextis a model-specific lossy interpretation of those messages.🛠 My approach: Semantic summarization
Based on that understanding, I implemented a more robust solution for maintaining conversational state:
🔗 https://github.com/socketry/async-ollama/blob/main/lib/async/ollama/conversation.rb
In this design, as the conversation grows and approaches the model's context window limit, we automatically ask the model to summarize the conversation so far — effectively performing semantic compression on the agent's memory. This gives you explicit control over how summarization happens and what information is preserved. It requires a bit more logic, but in practice it’s far more resilient and extensible.
🔍 Supporting References
To understand how Ollama currently handles
context, consider:Context is decoded (detokenized) as raw text:
https://github.com/ollama/ollama/blob/764be7480f19f1749c518b21cead7c3a44c04b1d/server/routes.go#L275
That decoded context is prepended to the prompt template, becoming part of the model input:
https://github.com/ollama/ollama/blob/764be7480f19f1749c518b21cead7c3a44c04b1d/server/routes.go#L278-L283
The new prompt + model output is tokenized and returned as the next
context:https://github.com/ollama/ollama/blob/764be7480f19f1749c518b21cead7c3a44c04b1d/server/routes.go#L335-L340
In other words, the context is no different from the message stream, except for all the limitations already discussed.
In short, while
contextis a clever mechanism for token-level continuity, it is low-level, fragile, and not future-proof. Treating conversation history as structured data, with optional summarization and trimming, gives you much greater flexibility and robustness for real-world agents.@WizardMiner I understand your concern but I think it's based on faulty assumptions of what context represents. I had similar misunderstandings but after reading the code I realised I was making assumptions that weren't correct about the meaning of
context. I provided the links to the code so you can read exactly what's going on.For the input side, either (1) The messages generate the prompt or (2) The context is decoded into the prompt. The context in this case is just a lossy representation of
messages. So it's actually worse, basically, and you'll almost certainly always be better off usingmessages.You can actually validate this yourself (e.g. the branching issue you mention) by using context and/or messages with temperature: 0 - you will get identical results.
@pd95 commented on GitHub (Jul 26, 2025):
@WizardMiner just don't use ChatGPT to summarize what Ollama does and how it works. Your chat with ChatGPT does not deeply inspect the source and try to understand it. It also seems that it doesn't take any argument we have brought to light and documented here in this issue... you should ask it about the source snippets and how they illustrate the inner workings of the Generate method.
ChatGPT has the tendency to just elaborate on the points you want to read. It should take more context of the real source of truth into account: the actual implementation which is open source! We have already a lot of indications which sources to check.
Please go ahead and use ChatGPT to better understand how Ollama works. Don't take a discussion as input!