* Reduce startup model hydration
Add a lightweight model list cache for tags and launch inventory, while keeping show cache population lazy. This avoids loading every local model at startup on large model stores.
* harden flaky scheduler unit test
* remove extra launch model metadata text
* review comments
* review comments
* mlx: add laguna model support
* convert: support fp8 safetensors import
Decode HF F8_E4M3 safetensors with block scale companions into GGUF-supported tensor types, and record which output tensors came from FP8 source weights.
Use that source-precision metadata during create quantization: default FP8-sourced GGUFs to Q8_0, keep non-FP8 tensors at their original precision for Q8_0, and promote non-FP8 quantizable tensors to Q8_0 for Q4_K requests.
* ggml: add laguna model support
* server: preserve generate logprobs with builtin parsers
Generate requests were dropping logprob-only chunks whenever a builtin parser buffered visible content. Chat already handled this case, but generate only forwarded chunks with visible response, thinking, or tool-call output.
Keep generate chunks that carry logprobs even when the builtin parser has not flushed visible content yet, and add a regression test that exercises the behavior with a generic thinking parser.
* review comments - perf improvements
* ggml: implement nemotron 3 nano omni
* add poolside integration
* update poolside doc
* adapt to new cache setup
* fix test
* fix test
---------
Co-authored-by: Eva Ho <hoyyeva@gmail.com>
A stop-gap for now to guide users better. We'll add more in-depth recommendations per integration as well.
---------
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>