[GH-ISSUE #9944] Allow BF16 and F32 model import from tensors files without F16 conversion #32269

Closed
opened 2026-04-22 13:22:31 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @rjmalagon on GitHub (Mar 22, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9944

Originally assigned to: @pdevine on GitHub.

When creating a model from BF16 tensors with ollama create, it defaults to F16 conversion, even if -q bf16 switch is provided, where it first converts to F16 at load and then quantize it to BF16. While this is textually correct from ´-q´ switch use, there is not a proper switch to import BF16 or F32 without F16 conversion.

Many models on tensor/safetensors already come on BF16 (and F32), it is a somewhat a downgrade to first convert to F32/F16 if F32/BF16 (or pure F32/F32) is intended.

ollama create -f Modelfile-gemma3 -q bf16 gemma-3:12-it-bf16
gathering model components 
copying file sha256:50b2f405ba56a26d4913fd772089992252d7f942123cc0a034d96424221ba946 100% 
copying file sha256:788cc42a1a92835df62d9a3791f47105f63504c7c404637a73288e9b11bc7b82 100% 
copying file sha256:bfe25c2735e395407beb78456ea9a6984a1f00d8c16fa04a8b75f2a614cf53e1 100% 
copying file sha256:ed14bd4908c98fed9f61e8cd410167e0846de9abd78e0452ab092072e5d9252d 100% 
copying file sha256:f688d6bb20c5017601c4011de7ca656da8485b540b05013efdaf986c0fcc918d 100% 
copying file sha256:3ffd5f11778dc73e2b69b3c00535e4121e1badf7018136263cd17b5b34fbaa53 100% 
copying file sha256:2f7b0adf4fb469770bb1490e3e35df87b1dc578246c5e7e6fc76ecf33213a397 100% 
copying file sha256:4667f2089529e8e7657cfb6d1c19910ae71ff5f28aa7ab2ff2763330affad795 100% 
copying file sha256:4847447e92599833e8dbaa3067cd201c3bb5c052efa91f11ba891e43234f7832 100% 
copying file sha256:891bd54eed03cba9ee1e705533a02a8217fcc29f356e4a1f53e5fd0d178883ad 100% 
copying file sha256:7cee411d9d57324e50ce064a192cc5a858276d508611b12fc599e0c9767112e0 100% 
copying file sha256:8bc75a29a730c9e743cad013feda3b0991a913fafe787c58a1c6e20afad97723 100% 
copying file sha256:fe16baf728db49457cde32802cd7efc0ac8a7a9877dbe22fe3322b2d9dc6ccd9 100% 
copying file sha256:39172c4124d3470341bbbb25f2926fd97edf68f0fe3a9fa4cde6acb9b7ed2cc6 100% 
copying file sha256:fd9324becc53c4be610db39e13a613006f09fd6ef71a95fb6320dc33157490a3 100% 
copying file sha256:1299c11d7cf632ef3b4e11937501358ada021bbdf7c47638d13c0ee982f2e79c 100% 
converting model 
quantizing F16 model to BF16
creating new layer sha256:52201f498c01049fdbce5e05094db5b868ff23704baf2571cf4d071967b51920 
using existing layer sha256:e0a42594d802e5d31cdc786deb4823edb8adff66094d49de8fffe976d753e348 
using existing layer sha256:dd084c7d92a3c1c14cc09ae77153b903fd2024b64a100a0cc8ec9316063d2dbc 
using existing layer sha256:d3a76cb8c4a07d0a6c82ac6e839f98816b5077699d393b2cc77008c16d8078ac 
writing manifest 
success 

As a note, this same example with llama.cpp convert_hf_to_gguf.py script converts torch.bfloat16 to F32/BF16

INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> BF16, shape = {5120, 131072}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {5120}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.bfloat16 --> BF16, shape = {32768, 5120}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.bfloat16 --> BF16, shape = {5120, 32768}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.bfloat16 --> BF16, shape = {5120, 32768}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {5120}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.bfloat16 --> BF16, shape = {5120, 1024}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.bfloat16 --> BF16, shape = {4096, 5120}
INFO:hf-to-gguf:blk.0.attn_q.weight,         torch.bfloat16 --> BF16, shape = {5120, 4096}
INFO:hf-to-gguf:blk.0.attn_v.weight,         torch.bfloat16 --> BF16, shape = {5120, 1024}

I already tried to mod the "reader_safetensors.go" file, but my lack of GO dev skills only ugly butch the routine.

Originally created by @rjmalagon on GitHub (Mar 22, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9944 Originally assigned to: @pdevine on GitHub. When creating a model from BF16 tensors with `ollama create`, it defaults to F16 conversion, even if `-q bf16` switch is provided, where it first converts to F16 at load and then quantize it to BF16. While this is textually correct from ´-q´ switch use, there is not a proper switch to import BF16 or F32 without F16 conversion. Many models on tensor/safetensors already come on BF16 (and F32), it is a somewhat a downgrade to first convert to F32/F16 if F32/BF16 (or pure F32/F32) is intended. ``` ollama create -f Modelfile-gemma3 -q bf16 gemma-3:12-it-bf16 gathering model components copying file sha256:50b2f405ba56a26d4913fd772089992252d7f942123cc0a034d96424221ba946 100% copying file sha256:788cc42a1a92835df62d9a3791f47105f63504c7c404637a73288e9b11bc7b82 100% copying file sha256:bfe25c2735e395407beb78456ea9a6984a1f00d8c16fa04a8b75f2a614cf53e1 100% copying file sha256:ed14bd4908c98fed9f61e8cd410167e0846de9abd78e0452ab092072e5d9252d 100% copying file sha256:f688d6bb20c5017601c4011de7ca656da8485b540b05013efdaf986c0fcc918d 100% copying file sha256:3ffd5f11778dc73e2b69b3c00535e4121e1badf7018136263cd17b5b34fbaa53 100% copying file sha256:2f7b0adf4fb469770bb1490e3e35df87b1dc578246c5e7e6fc76ecf33213a397 100% copying file sha256:4667f2089529e8e7657cfb6d1c19910ae71ff5f28aa7ab2ff2763330affad795 100% copying file sha256:4847447e92599833e8dbaa3067cd201c3bb5c052efa91f11ba891e43234f7832 100% copying file sha256:891bd54eed03cba9ee1e705533a02a8217fcc29f356e4a1f53e5fd0d178883ad 100% copying file sha256:7cee411d9d57324e50ce064a192cc5a858276d508611b12fc599e0c9767112e0 100% copying file sha256:8bc75a29a730c9e743cad013feda3b0991a913fafe787c58a1c6e20afad97723 100% copying file sha256:fe16baf728db49457cde32802cd7efc0ac8a7a9877dbe22fe3322b2d9dc6ccd9 100% copying file sha256:39172c4124d3470341bbbb25f2926fd97edf68f0fe3a9fa4cde6acb9b7ed2cc6 100% copying file sha256:fd9324becc53c4be610db39e13a613006f09fd6ef71a95fb6320dc33157490a3 100% copying file sha256:1299c11d7cf632ef3b4e11937501358ada021bbdf7c47638d13c0ee982f2e79c 100% converting model quantizing F16 model to BF16 creating new layer sha256:52201f498c01049fdbce5e05094db5b868ff23704baf2571cf4d071967b51920 using existing layer sha256:e0a42594d802e5d31cdc786deb4823edb8adff66094d49de8fffe976d753e348 using existing layer sha256:dd084c7d92a3c1c14cc09ae77153b903fd2024b64a100a0cc8ec9316063d2dbc using existing layer sha256:d3a76cb8c4a07d0a6c82ac6e839f98816b5077699d393b2cc77008c16d8078ac writing manifest success ``` As a note, this same example with llama.cpp convert_hf_to_gguf.py script converts torch.bfloat16 to F32/BF16 ``` INFO:hf-to-gguf:token_embd.weight, torch.bfloat16 --> BF16, shape = {5120, 131072} INFO:hf-to-gguf:blk.0.attn_norm.weight, torch.bfloat16 --> F32, shape = {5120} INFO:hf-to-gguf:blk.0.ffn_down.weight, torch.bfloat16 --> BF16, shape = {32768, 5120} INFO:hf-to-gguf:blk.0.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {5120, 32768} INFO:hf-to-gguf:blk.0.ffn_up.weight, torch.bfloat16 --> BF16, shape = {5120, 32768} INFO:hf-to-gguf:blk.0.ffn_norm.weight, torch.bfloat16 --> F32, shape = {5120} INFO:hf-to-gguf:blk.0.attn_k.weight, torch.bfloat16 --> BF16, shape = {5120, 1024} INFO:hf-to-gguf:blk.0.attn_output.weight, torch.bfloat16 --> BF16, shape = {4096, 5120} INFO:hf-to-gguf:blk.0.attn_q.weight, torch.bfloat16 --> BF16, shape = {5120, 4096} INFO:hf-to-gguf:blk.0.attn_v.weight, torch.bfloat16 --> BF16, shape = {5120, 1024} ``` I already tried to mod the "reader_safetensors.go" file, but my lack of GO dev skills only ugly butch the routine.
GiteaMirror added the feature request label 2026-04-22 13:22:31 -05:00
Author
Owner

@rjmalagon commented on GitHub (Mar 23, 2025):

I see this line 258 on server/server.go ce929984a3/server/create.go (L258)
¿Is this the only part needed to change to select higher precision on safetensors import?

It would be nice to autodetect the origin model precision to select according to it, or give the user a switch to select desirable intermediate conversion precision.

<!-- gh-comment-id:2746287936 --> @rjmalagon commented on GitHub (Mar 23, 2025): I see this line 258 on server/server.go https://github.com/ollama/ollama/blob/ce929984a33230269905e0e3cfa335cb8d6ba781/server/create.go#L258 ¿Is this the only part needed to change to select higher precision on safetensors import? It would be nice to autodetect the origin model precision to select according to it, or give the user a switch to select desirable intermediate conversion precision.
Author
Owner

@rjmalagon commented on GitHub (Mar 24, 2025):

Well, I already found out how to get F32 convert before quantization.
This line can be changed to force F32 on all, ugly and hacky but works. The best I can do as a golang muggle.

ce929984a3/convert/reader.go (L50)

<!-- gh-comment-id:2746948637 --> @rjmalagon commented on GitHub (Mar 24, 2025): Well, I already found out how to get F32 convert before quantization. This line can be changed to force F32 on all, ugly and hacky but works. The best I can do as a golang muggle. https://github.com/ollama/ollama/blob/ce929984a33230269905e0e3cfa335cb8d6ba781/convert/reader.go#L50
Author
Owner

@pdevine commented on GitHub (Mar 24, 2025):

This is definitely something I'd like to get done soon. I actually started a partial implementation the other day and it ended up being a bigger change than I was anticipating. It's easy enough to hack in the BF16 conversion part, but the BF16 backend isn't actually compiled in yet so you can't actually run the weights after you've converted them.

<!-- gh-comment-id:2749080206 --> @pdevine commented on GitHub (Mar 24, 2025): This is definitely something I'd like to get done soon. I actually started a partial implementation the other day and it ended up being a bigger change than I was anticipating. It's easy enough to hack in the BF16 conversion part, but the BF16 backend isn't actually compiled in yet so you can't actually run the weights after you've converted them.
Author
Owner

@rjmalagon commented on GitHub (Mar 25, 2025):

Thanks, @pdevine, for the attention. I will wait for it without making too much fuss.
It's me that works with small models with high precision, when many run big models on limited precision. These small improvements are welcome, even when it comes at their own pace.

<!-- gh-comment-id:2751498965 --> @rjmalagon commented on GitHub (Mar 25, 2025): Thanks, @pdevine, for the attention. I will wait for it without making too much fuss. It's me that works with small models with high precision, when many run big models on limited precision. These small improvements are welcome, even when it comes at their own pace.
Author
Owner

@rjmalagon commented on GitHub (Jun 14, 2025):

@pdevine I will reconsider my position (of quiet waiting) on BF16 support for the conversion (and the new Ollama engine) to enough vocal do not let this feature request to cool. After evaluating the impact of accuracy on medium/small models on constrained hardware, I arrived at this conclusion after testing the FP32, BF16, and Q8 fine mix of Unsloth dynamic quants.

If not with hands (I am a GO 'Quib'), is there a way to help the Ollama team with this?

<!-- gh-comment-id:2972840220 --> @rjmalagon commented on GitHub (Jun 14, 2025): @pdevine I will reconsider my position (of quiet waiting) on BF16 support for the conversion (and the new Ollama engine) to enough vocal do not let this feature request to cool. After evaluating the impact of accuracy on medium/small models on constrained hardware, I arrived at this conclusion after testing the FP32, BF16, and Q8 fine mix of Unsloth dynamic quants. If not with hands (I am a GO 'Quib'), is there a way to help the Ollama team with this?
Author
Owner

@york-cmd commented on GitHub (Jun 14, 2025):

这是来自QQ邮箱的假期自动回复邮件。
 
您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。

<!-- gh-comment-id:2972840808 --> @york-cmd commented on GitHub (Jun 14, 2025): 这是来自QQ邮箱的假期自动回复邮件。   您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32269