mirror of https://github.com/ollama/ollama.git synced 2026-05-07 00:22:43 -05:00

Files

Daniel Hiltgen 4fe5609563 metal: harden for ggml initialization failures (#15755 )

* metal: harden for ggml initialization failures

ggml_metal_device_init performs a probe to verify the tensor API compiles.  On
some systems this passes, even though kernel coverage isn't complete, which
results in a later crash when compiling the real kernels.  This change adds a
single retry if any of the error strings match this failure mode to disable the
tensor API.  It also hardens an error case in the Go initDevices to detect
device initialization failures and panic instead of crashing later on a nil
array entry.

Fixes #15734

* review comments

* review comments

2026-04-30 16:28:03 -07:00

common

server: add logprobs and top_logprobs support to Ollama's API (#12899 )

2025-11-11 08:49:50 -08:00

llamarunner

flash attn: add auto mode for llama engine (#13052 )

2025-12-12 13:27:19 -08:00

ollamarunner

metal: harden for ggml initialization failures (#15755 )

2026-04-30 16:28:03 -07:00

README.md

…

runner.go

Add MLX runner with GLM4-MoE-Lite model support (#14185 )

2026-02-10 14:57:57 -08:00

README.md

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding