[PR #14969] Create safetensors models through a hybrid local/remote pipeline. #46193

Open
opened 2026-04-25 01:42:42 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14969
Author: @dhiltgen
Created: 3/19/2026
Status: 🔄 Open

Base: mainHead: create_phase2


📝 Commits (1)

  • 86d74dc Create safetensors models through a hybrid local/remote pipeline.

📊 Changes

32 files changed (+3875 additions, -395 deletions)

View changed files

📝 api/client.go (+13 -0)
📝 api/types.go (+11 -0)
📝 cmd/cmd.go (+53 -5)
📝 cmd/cmd_test.go (+27 -0)
📝 convert/reader_safetensors.go (+1 -0)
📝 discover/gpu_info_darwin.m (+12 -4)
📝 docs/api.md (+39 -1)
📝 fs/ggml/gguf.go (+217 -7)
📝 fs/ggml/gguf_test.go (+599 -0)
📝 integration/create_test.go (+57 -16)
📝 progress/bar.go (+4 -0)
📝 server/create.go (+304 -1)
📝 server/images.go (+92 -54)
📝 server/images_test.go (+20 -0)
📝 server/quantization.go (+31 -1)
📝 server/routes_create_test.go (+395 -125)
📝 server/sched_test.go (+4 -4)
x/create/capabilities.go (+119 -0)
📝 x/create/client/create.go (+20 -109)
📝 x/create/client/create_test.go (+142 -12)

...and 12 more files

📄 Description

Local OLLAMA_HOST uses the optimized local create path by default, while non-local hosts use the API upload/create flow. OLLAMA_CREATE_REMOTE can force the remote path for testing, and OLLAMA_CREATE_SERVER_QUANTIZE can force remote quantization to happen on the server.

Remote safetensors creation now reuses x/create.CreateSafetensorsModel as the single traversal/import path. This keeps remote behavior aligned with local handling for architecture-specific transforms, source FP8 and prequantized tensors, skipped companion tensors, expert grouping, and per-tensor quantization decisions.

Fix server-side safetensors assembly to preserve Modelfile parameters and FileType metadata, and handle packed tensor groups during server-side quantization without dropping tensors. Packed groups are quantized as multi-tensor blobs instead of attempting to quantize the group name as a single tensor.

Improve local packed-group creation by streaming from file-backed TensorData for unquantized groups instead of extracting whole tensor byte slices.

Reduce GGUF creation memory by passing through BF16 tensors when no conversion is needed and by making large tensor writes exclusive while still allowing smaller tensor writes to run concurrently.

Add focused tests for remote request metadata, server/client quantization selection, packed quantized tensor preservation, local packed streaming, remote upload/create failures, and GGUF write concurrency.

Test results when creating from gemma-4-31b-it

Flow Wall Time CLI Max RSS Server Max RSS Notes
Safetensors local create, 4-bit quant 29.78s 7.4 GiB N/A Direct local path. Model size 18.85 GiB; 1195 manifest layers.
Safetensors remote create, 4-bit quant 41.72s 7.4 GiB 68 MiB Forced remote path with client-side quantization. Model size 18.85 GiB; 1195 manifest layers.
Safetensors remote create, 4-bit quant, server-side 105.38s 53 MiB 7.46 GiB Forced remote path with server-side quantization. Model size 18.85 GiB; 1195 manifest layers.
GGUF 31B, q4_k_m, dynamic memory budget final 220.63s CLI / 2m50s server create 47.0 MiB 42.0 GiB CPU saturated, no swaps, uses subset of free memory budget
GGUF create, 4-bit quant, unmodifiedmain - - 72.05 GiB Upstream baseline for the same 31B model and q4_k_m; peak memory footprint was 79.84 GiB.

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14969 **Author:** [@dhiltgen](https://github.com/dhiltgen) **Created:** 3/19/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `create_phase2` --- ### 📝 Commits (1) - [`86d74dc`](https://github.com/ollama/ollama/commit/86d74dc227c9bea5612dee644ae0227a03394e59) Create safetensors models through a hybrid local/remote pipeline. ### 📊 Changes **32 files changed** (+3875 additions, -395 deletions) <details> <summary>View changed files</summary> 📝 `api/client.go` (+13 -0) 📝 `api/types.go` (+11 -0) 📝 `cmd/cmd.go` (+53 -5) 📝 `cmd/cmd_test.go` (+27 -0) 📝 `convert/reader_safetensors.go` (+1 -0) 📝 `discover/gpu_info_darwin.m` (+12 -4) 📝 `docs/api.md` (+39 -1) 📝 `fs/ggml/gguf.go` (+217 -7) 📝 `fs/ggml/gguf_test.go` (+599 -0) 📝 `integration/create_test.go` (+57 -16) 📝 `progress/bar.go` (+4 -0) 📝 `server/create.go` (+304 -1) 📝 `server/images.go` (+92 -54) 📝 `server/images_test.go` (+20 -0) 📝 `server/quantization.go` (+31 -1) 📝 `server/routes_create_test.go` (+395 -125) 📝 `server/sched_test.go` (+4 -4) ➕ `x/create/capabilities.go` (+119 -0) 📝 `x/create/client/create.go` (+20 -109) 📝 `x/create/client/create_test.go` (+142 -12) _...and 12 more files_ </details> ### 📄 Description Local OLLAMA_HOST uses the optimized local create path by default, while non-local hosts use the API upload/create flow. OLLAMA_CREATE_REMOTE can force the remote path for testing, and OLLAMA_CREATE_SERVER_QUANTIZE can force remote quantization to happen on the server. Remote safetensors creation now reuses x/create.CreateSafetensorsModel as the single traversal/import path. This keeps remote behavior aligned with local handling for architecture-specific transforms, source FP8 and prequantized tensors, skipped companion tensors, expert grouping, and per-tensor quantization decisions. Fix server-side safetensors assembly to preserve Modelfile parameters and FileType metadata, and handle packed tensor groups during server-side quantization without dropping tensors. Packed groups are quantized as multi-tensor blobs instead of attempting to quantize the group name as a single tensor. Improve local packed-group creation by streaming from file-backed TensorData for unquantized groups instead of extracting whole tensor byte slices. Reduce GGUF creation memory by passing through BF16 tensors when no conversion is needed and by making large tensor writes exclusive while still allowing smaller tensor writes to run concurrently. Add focused tests for remote request metadata, server/client quantization selection, packed quantized tensor preservation, local packed streaming, remote upload/create failures, and GGUF write concurrency. Test results when creating from `gemma-4-31b-it` | Flow | Wall Time | CLI Max RSS | Server Max RSS | Notes | |---|---:|---:|---:|---| | Safetensors local create, 4-bit quant | `29.78s` | `7.4 GiB` | `N/A` | Direct local path. Model size `18.85 GiB`; `1195` manifest layers. | | Safetensors remote create, 4-bit quant | `41.72s` | `7.4 GiB` | `68 MiB` | Forced remote path with client-side quantization. Model size `18.85 GiB`; `1195` manifest layers. | | Safetensors remote create, 4-bit quant, server-side | `105.38s` | `53 MiB` | `7.46 GiB` | Forced remote path with server-side quantization. Model size `18.85 GiB`; `1195` manifest layers. | | GGUF 31B, q4_k_m, dynamic memory budget final | 220.63s CLI / 2m50s server create | 47.0 MiB | 42.0 GiB | CPU saturated, no swaps, uses subset of free memory budget | | GGUF create, 4-bit quant, unmodified`main` | `-` | `-` | `72.05 GiB` | Upstream baseline for the same 31B model and `q4_k_m`; peak memory footprint was `79.84 GiB`. | --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 01:42:42 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#46193