Currently for both the old and new engines, there is code to
calculate how much memory is required for a model and lay out
the layers onto GPUs. This reuses the new engine's lay out code
for the old engine as well, bringing them closer together. The
old engine continues to use its current method of estimating
required memory.
This reduces maintainence effort and improves consistency, as new
features only need to be implemented in one place. The newer code
is also more accurate, especially with multiple GPUs.
Adds logprobs support to Ollama's API including support for Ollama's
OpenAI-compatible API. By specifying the new 'logprobs' boolean parameter
in the API, Ollama will return the log probabilities for each token generated.
'top_logprobs', an integer value can also be specified up to the value 20.
When specified, the API will also provide the number of most likely tokens to
return at each token position
Co-authored-by: Baptiste Jamin <baptiste@crisp.chat>
* app: add code for macOS and Windows apps under 'app'
* app: add readme
* app: windows and linux only for now
* ci: fix ui CI validation
---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
On main, the `RENDERER` and `PARSER` fields from the `Modelfile` don't
get propagated to a new model created with a `req.From` parameter. This
is easily triggered via `ollama run qwen3-coder`, then running some save
command like `/save qwen3-coder-custom`.
Added a regression test for this, and then open the config for the
"from" model in order to use its renderer/parser as a default for the
new model. This will fix the CLI and also API-based creates.
Fixes: https://github.com/ollama/ollama/issues/12792
Currently, checking the length of prompts for embeddings to ensure
they fit in the context window (and possible truncation) occurs in
two places - the Ollama server and runner. This can lead to
inconsistencies in both the checks and reported number of tokens
processed. Since we have to do this processing in the runner, this
consolidates all of the logic there.
* DRY out the runner lifecycle code
Now that discovery uses the runners as well, this unifies the runner spawning code
into a single place. This also unifies GPU discovery types with the newer ml.DeviceInfo
* win: make incremental builds better
Place build artifacts in discrete directories so incremental builds don't have to start fresh
* Adjust sort order to consider iGPUs
* handle cpu inference oom scenarios
* review comments
* test: harden scheduler tests
This removes reschedDelay which was stale code, and adds
a new configurable timeout for the waitForVRAMRecovery so
tests can now set the timeout to be very short to avoid the
scheduler getting stuck and hitting a test timeout.
* test: tune tests for partial loads
Give stress tests more time when the model is split between CPU/GPU
Adds a temporary global flag to renderers that causes renderers to always
render images as [img]. In a follow up change, we will consider making this
the default, and this flag could eventually be removed
* working (other than tool call is the incorrect order) for tool calls and tools
* Tests work, other than image tags (tests do not go through server) and tools (not in the correct order, but contents are the same)
* testing for qwen3vl parser - toolparser is working
* made changes to JSON tool parser, wraps the TollCallFunction with a TollCall object
* Working parser for thinking models - assumes state of thinking, emits unambiguous content in thinking, does not call tool call in thinking
* changed the parser to start with collecting content
* thinking prefill
* add hasThinkingSupport parameter to parser
* qwen3-vl -> qwen3-vl-instruct for renderer/parser
* Add hasThinkingSupport=false to QwenVLParser
---------
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>
Made it so when api/generate builds up a message array and generates the
prompt it now goes through the same function as `api/chat` for
consistency. This is where we hook the optional built-in renderers to
bypass templates, which was missing for `api/generate` before this
change.
Closes: #12578
* logs: quiet down context canceled on completion
If the client closes the connection before Completion finishes, we were
logging at error level implying the runner crashed which was misleading.
time=2025-10-08T22:59:20.566-07:00 level=ERROR source=server.go:1490 msg="post predict" error="Post \"http://127.0.0.1:57736/completion\": context canceled"
* quiet down scheduler log error on expected case
Since we don't hold the lock while performing memory load calculations, other
runners can unload in parallel, so finding no runner to unload is a valid scenario
which we shouldn't log at error level.
This revamps how we discover GPUs in the system by leveraging the Ollama
runner. This should eliminate inconsistency between our GPU discovery and the
runners capabilities at runtime, particularly for cases where we try to filter
out unsupported GPUs. Now the runner does that implicitly based on the actual
device list. In some cases free VRAM reporting can be unreliable which can
leaad to scheduling mistakes, so this also includes a patch to leverage more
reliable VRAM reporting libraries if available.
Automatic workarounds have been removed as only one GPU leveraged this, which
is now documented. This GPU will soon fall off the support matrix with the next
ROCm bump.
Additional cleanup of the scheduler and discovery packages can be done in the
future once we have switched on the new memory management code, and removed
support for the llama runner.
* auth: fix problems with the ollama keypairs
This change adds several fixes including:
- reading in the pubkey files correctly
- fixing the push unit test to create a keypair file in a temp directory
- not return 500 errors for normal status error
Now that we have a built-in parser abstraction, which was introduced in
<https://github.com/ollama/ollama/pull/12248>, we can modify our harmony
parser to match this and then get rid of nearly all of the
harmony-specific logic in routes.go. We do have a small amount of
code that turns the parser on by default if the architecture matches and
no other built-in parser was provided.
The built-in parser interface was modified in order to handle harmony's
prefill and tool name translation requirements.