[GH-ISSUE #12426] Cannot run finetuned embeddinggemma GGUF model #70312

Open
opened 2026-05-04 21:04:31 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @blackccpie on GitHub (Sep 26, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12426

What is the issue?

I'm having trouble running a custom embeddinggemma GGUF model, that I finetuned myself and converted to GGUF using latest llama.cpp's convert_hf_to_gguf.py script.
I can import the model correctly, but when trying to invoke it I get an error related to model architecture gemma-embedding not being supported.
When checking the base embeddinggemma model from Ollama, its architecture is set to gemma3.

Is there any plan to align architectures between llama.cpp and Ollama regarding embeddinggemma?

Relevant log output

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma-embedding'

OS

Ubuntu

GPU

Quadro RTX3000

CPU

Core i7

Ollama version

0.12.2

Originally created by @blackccpie on GitHub (Sep 26, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12426 ### What is the issue? I'm having trouble running a custom _embeddinggemma_ GGUF model, that I finetuned myself and converted to GGUF using latest llama.cpp's `convert_hf_to_gguf.py` script. I can import the model correctly, but when trying to invoke it I get an error related to model architecture `gemma-embedding` not being supported. When checking the base _embeddinggemma_ model from Ollama, its architecture is set to `gemma3`. Is there any plan to align architectures between llama.cpp and Ollama regarding _embeddinggemma_? ### Relevant log output ```shell llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma-embedding' ``` ### OS _Ubuntu_ ### GPU _Quadro RTX3000_ ### CPU _Core i7_ ### Ollama version _0.12.2_
GiteaMirror added the bugneeds more info labels 2026-05-04 21:04:32 -05:00
Author
Owner

@pdevine commented on GitHub (Sep 26, 2025):

The implementations between llama.cpp and ollama are different from each other, so it's hard to align on each of the names. Can you edit the conversion script to output the correct name? You should also check that the other kv names match.

<!-- gh-comment-id:3340078321 --> @pdevine commented on GitHub (Sep 26, 2025): The implementations between llama.cpp and ollama are different from each other, so it's hard to align on each of the names. Can you edit the conversion script to output the correct name? You should also check that the other kv names match.
Author
Owner

@blackccpie commented on GitHub (Sep 30, 2025):

@pdevine I tried to patch the conversion script, sucessfully generated the gguf file and pulled it to ollama, but when requesting an embed I get the following ollama error from the logs:

ollama[302]: time=2025-09-29T17:53:30.688+02:00 level=INFO source=ggml.go:131 msg="" architecture=gemma3 file_type=F16 name="Ft_Embeddinggemma 300m" description="" num_tensors=314 num_key_values=38
ollama[302]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
ollama[302]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ollama[302]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ollama[302]: ggml_cuda_init: found 1 CUDA devices:
ollama[302]: Device 0: Quadro RTX 3000, compute capability 7.5, VMM: yes, ID: GPU-051c42d4-3e4d-0e77-4ade-7b73f93b290a
ollama[302]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
ollama[302]: time=2025-09-29T17:53:31.822+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
ollama[302]: time=2025-09-29T17:53:31.832+02:00 level=INFO source=server.go:3634 msg="http: panic serving 127.0.0.1:39384: runtime error: invalid memory address or nil pointer dereference\ngoroutine 40 [running]:\nnet/http.(*conn).serve.func1()\n\tnet/http/server.go:1947 +0xbe\npanic({0x63b06a3e2780?, 0x63b06ad37430?})\n\truntime/panic.go:787 +0x132\ngithub.com/ollama/ollama/runner/ollamarunner.(*Server).allocModel.func1()\n\tgithub.com/ollama/ollama/runner/ollamarunner/runner.go:1091 +0x11a\npanic({0x63b06a3e2780?, 0x63b06ad37430?})\n\truntime/panic.go:787 +0x132\ngithub.com/ollama/ollama/ml/nn.(*Linear).Forward(0x0, {0x63b06a5500a0, 0xc000c33bc0}, {0x63b06a55ad08?, 0xc000c6e828?})\n\tgithub.com/ollama/ollama/ml/nn/linear.go:11 +0x27\ngithub.com/ollama/ollama/model/models/gemma3.(*embedModel).Forward(0xc000fe7900, {0x63b06a5500a0, 0xc000c33bc0}, {{0x63b06a55ad08, 0xc000c406d8}, {0x63b06a55ad08, 0xc000c406f0}, {0xc0004d3000, 0x200, 0x200}, ...})\n\tgithub.com/ollama/ollama/model/models/gemma3/embed.go:27 +0xf3\ngithub.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc000224f00)\n\tgithub.com/ollama/ollama/runner/ollamarunner/runner.go:1062 +0xc0d\ngithub.com/ollama/ollama/runner/ollamarunner.(*Server).allocModel(0xc000224f00, {0x7ffec1a7659f?, 0x63b06937143a?}, {0x0, 0x6, {0xc0004c05d0, 0x1, 0x1}, 0x0}, {0x0, ...}, ...)\n\tgithub.com/ollama/ollama/runner/ollamarunner/runner.go:1124 +0x2ac\ngithub.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc000224f00, {0x63b06a544748, 0xc0004d0000}, 0xc0004c2000)\n\tgithub.com/ollama/ollama/runner/ollamarunner/runner.go:1198 +0x54d\nnet/http.HandlerFunc.ServeHTTP(0xc000488fc0?, {0x63b06a544748?, 0xc0004d0000?}, 0xc0000efb60?)\n\tnet/http/server.go:2294 +0x29\nnet/http.(*ServeMux).ServeHTTP(0x63b069021d85?, {0x63b06a544748, 0xc0004d0000}, 0xc0004c2000)\n\tnet/http/server.go:2822 +0x1c4\nnet/http.serverHandler.ServeHTTP({0x63b06a540d90?}, {0x63b06a544748?, 0xc0004d0000?}, 0x1?)\n\tnet/http/server.go:3301 +0x8e\nnet/http.(*conn).serve(0xc0001443f0, {0x63b06a546a78, 0xc00016e060})\n\tnet/http/server.go:2102 +0x625\ncreated by net/http.(*Server).Serve in goroutine 1\n\tnet/http/server.go:3454 +0x485"
ollama[302]: time=2025-09-29T17:53:31.833+02:00 level=INFO source=runner.go:1171 msg=load request="{Operation:close LoraPath:[] Parallel:0 BatchSize:0 FlashAttention:false KvSize:0 KvCacheType: NumThreads:0 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
ollama[302]: time=2025-09-29T17:53:31.834+02:00 level=INFO source=sched.go:438 msg="Load failed" model=/usr/share/ollama/.ollama/models/blobs/sha256-5f21b784b5309c953927bfa751b541377b44df529cadd0cc795a9567d7c732dd error="do load request: Post "http://127.0.0.1:41397/load": EOF"
ollama[302]: time=2025-09-29T17:53:31.848+02:00 level=ERROR source=server.go:425 msg="llama runner terminated" error="signal: killed"

I also checked the model using the llama-embed tool (also had to patch llama-arch.cpp file to avoid confusion with the gemma3 conversational model arch), and the output seems ok (single embedding vector).

So for now I have no straightforward solution... the other option I have is patching the gguf metadata but there is more work building a script to do that.

Any idea how I could investigate the ollama crash or unit-test the llama.cpp based embedder in Ollama?

<!-- gh-comment-id:3350697874 --> @blackccpie commented on GitHub (Sep 30, 2025): @pdevine I tried to patch the conversion script, sucessfully generated the _gguf_ file and pulled it to ollama, but when requesting an embed I get the following ollama error from the logs: > ollama[302]: time=2025-09-29T17:53:30.688+02:00 level=INFO source=ggml.go:131 msg="" architecture=gemma3 file_type=F16 name="Ft_Embeddinggemma 300m" description="" num_tensors=314 num_key_values=38 ollama[302]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so ollama[302]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ollama[302]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ollama[302]: ggml_cuda_init: found 1 CUDA devices: ollama[302]: Device 0: Quadro RTX 3000, compute capability 7.5, VMM: yes, ID: GPU-051c42d4-3e4d-0e77-4ade-7b73f93b290a ollama[302]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so ollama[302]: time=2025-09-29T17:53:31.822+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) ollama[302]: time=2025-09-29T17:53:31.832+02:00 level=INFO source=server.go:3634 msg="http: panic serving 127.0.0.1:39384: runtime error: invalid memory address or nil pointer dereference\ngoroutine 40 [running]:\nnet/http.(*conn).serve.func1()\n\tnet/http/server.go:1947 +0xbe\npanic({0x63b06a3e2780?, 0x63b06ad37430?})\n\truntime/panic.go:787 +0x132\ngithub.com/ollama/ollama/runner/ollamarunner.(*Server).allocModel.func1()\n\tgithub.com/ollama/ollama/runner/ollamarunner/runner.go:1091 +0x11a\npanic({0x63b06a3e2780?, 0x63b06ad37430?})\n\truntime/panic.go:787 +0x132\ngithub.com/ollama/ollama/ml/nn.(*Linear).Forward(0x0, {0x63b06a5500a0, 0xc000c33bc0}, {0x63b06a55ad08?, 0xc000c6e828?})\n\tgithub.com/ollama/ollama/ml/nn/linear.go:11 +0x27\ngithub.com/ollama/ollama/model/models/gemma3.(*embedModel).Forward(0xc000fe7900, {0x63b06a5500a0, 0xc000c33bc0}, {{0x63b06a55ad08, 0xc000c406d8}, {0x63b06a55ad08, 0xc000c406f0}, {0xc0004d3000, 0x200, 0x200}, ...})\n\tgithub.com/ollama/ollama/model/models/gemma3/embed.go:27 +0xf3\ngithub.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc000224f00)\n\tgithub.com/ollama/ollama/runner/ollamarunner/runner.go:1062 +0xc0d\ngithub.com/ollama/ollama/runner/ollamarunner.(*Server).allocModel(0xc000224f00, {0x7ffec1a7659f?, 0x63b06937143a?}, {0x0, 0x6, {0xc0004c05d0, 0x1, 0x1}, 0x0}, {0x0, ...}, ...)\n\tgithub.com/ollama/ollama/runner/ollamarunner/runner.go:1124 +0x2ac\ngithub.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc000224f00, {0x63b06a544748, 0xc0004d0000}, 0xc0004c2000)\n\tgithub.com/ollama/ollama/runner/ollamarunner/runner.go:1198 +0x54d\nnet/http.HandlerFunc.ServeHTTP(0xc000488fc0?, {0x63b06a544748?, 0xc0004d0000?}, 0xc0000efb60?)\n\tnet/http/server.go:2294 +0x29\nnet/http.(*ServeMux).ServeHTTP(0x63b069021d85?, {0x63b06a544748, 0xc0004d0000}, 0xc0004c2000)\n\tnet/http/server.go:2822 +0x1c4\nnet/http.serverHandler.ServeHTTP({0x63b06a540d90?}, {0x63b06a544748?, 0xc0004d0000?}, 0x1?)\n\tnet/http/server.go:3301 +0x8e\nnet/http.(*conn).serve(0xc0001443f0, {0x63b06a546a78, 0xc00016e060})\n\tnet/http/server.go:2102 +0x625\ncreated by net/http.(*Server).Serve in goroutine 1\n\tnet/http/server.go:3454 +0x485" ollama[302]: time=2025-09-29T17:53:31.833+02:00 level=INFO source=runner.go:1171 msg=load request="{Operation:close LoraPath:[] Parallel:0 BatchSize:0 FlashAttention:false KvSize:0 KvCacheType: NumThreads:0 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" ollama[302]: time=2025-09-29T17:53:31.834+02:00 level=INFO source=sched.go:438 msg="Load failed" model=/usr/share/ollama/.ollama/models/blobs/sha256-5f21b784b5309c953927bfa751b541377b44df529cadd0cc795a9567d7c732dd error="do load request: Post \"http://127.0.0.1:41397/load\": EOF" ollama[302]: time=2025-09-29T17:53:31.848+02:00 level=ERROR source=server.go:425 msg="llama runner terminated" error="signal: killed" I also checked the model using the `llama-embed` tool (also had to patch `llama-arch.cpp` file to avoid confusion with the `gemma3` conversational model arch), and the output seems ok (single embedding vector). So for now I have no straightforward solution... the other option I have is patching the gguf metadata but there is more work building a script to do that. Any idea how I could investigate the ollama crash or unit-test the _llama.cpp_ based embedder in Ollama?
Author
Owner

@pdevine commented on GitHub (Sep 30, 2025):

OK, I think I may have figured this out. The problem is llama.cpp doesn't actually implement the model correctly (specifically the MRL or "Matryoshka Representation Learning" part), and there's a panic because we're trying to apply the dense tensors for MRL whereas the llama.cpp convertor just throws them out.

If you look at the output of ollama show -v embeddinggemma you can see the tensors at the top:

...
  Tensors
    dense.0.weight                       BF16    [768 3072]
    dense.1.weight                       BF16    [3072 768]
    token_embd.weight                    BF16    [768 262144]
    blk.0.attn_norm.weight               F32     [768]
    blk.0.ffn_down.weight                BF16    [1152 768]
    blk.0.ffn_gate.weight                BF16    [768 1152]
...

I think you'd have to patch the convert_hf_to_gguf.py script to include those tensors from your fused fine tuned safetensors. Ideally we'd just fix ollama create to import the safetensors directly but I think the gemma3 convertor is missing some stuff for embeddinggemma right now.

<!-- gh-comment-id:3353331206 --> @pdevine commented on GitHub (Sep 30, 2025): OK, I think I may have figured this out. The problem is llama.cpp doesn't actually implement the model correctly (specifically the MRL or "Matryoshka Representation Learning" part), and there's a panic because we're trying to apply the dense tensors for MRL whereas the llama.cpp convertor just throws them out. If you look at the output of `ollama show -v embeddinggemma` you can see the tensors at the top: ``` ... Tensors dense.0.weight BF16 [768 3072] dense.1.weight BF16 [3072 768] token_embd.weight BF16 [768 262144] blk.0.attn_norm.weight F32 [768] blk.0.ffn_down.weight BF16 [1152 768] blk.0.ffn_gate.weight BF16 [768 1152] ... ``` I think you'd have to patch the `convert_hf_to_gguf.py` script to include those tensors from your fused fine tuned safetensors. Ideally we'd just fix `ollama create` to import the safetensors directly but I think the gemma3 convertor is missing some stuff for embeddinggemma right now.
Author
Owner

@blackccpie commented on GitHub (Oct 1, 2025):

@pdevine thanks for the hints, I'll try to dig in that direction! (and yes indeed my fine tuned gguf is missing the first two dense tensors when dumping arch in ollama)

<!-- gh-comment-id:3356986117 --> @blackccpie commented on GitHub (Oct 1, 2025): @pdevine thanks for the hints, I'll try to dig in that direction! (and yes indeed my fine tuned gguf is missing the first two dense tensors when dumping arch in ollama)
Author
Owner

@blackccpie commented on GitHub (Oct 2, 2025):

@pdevine out of curiosity, I was wondering how the actual Ollama embeddinggemma gguf was generated?

In fact I tried importing my model in Ollama directly from the HF directory (using ollama create, had to patch my config.json to replace architecture "Gemma3TextModel" to "Gemma3ForCausalLM") and the imported model is also missing the two initial dense tensors layers, as you anticipated in your previous message.

(for information the "official" embeddinggemma coming from ggml-org is also missing these 2 tensors when imported)

<!-- gh-comment-id:3361654160 --> @blackccpie commented on GitHub (Oct 2, 2025): @pdevine out of curiosity, I was wondering how the actual Ollama _embeddinggemma_ gguf was generated? In fact I tried importing my model in Ollama directly from the HF directory (using `ollama create`, had to patch my `config.json` to replace architecture _"Gemma3TextModel"_ to _"Gemma3ForCausalLM"_) and the imported model is also missing the two initial dense tensors layers, as you anticipated in your previous message. (for information the "official" _embeddinggemma_ coming from [ggml-org](https://huggingface.co/ggml-org/embeddinggemma-300M-GGUF) is also missing these 2 tensors when imported)
Author
Owner

@pdevine commented on GitHub (Oct 2, 2025):

@blackccpie something we hacked up. There are a couple PRs we need to get in to fix create to work correctly; I actually didn't anticipate that someone would want to fine tune this model given how the MRL tensors work (and in fact, I'm not sure whether you have to thaw those tensors during the fine tune, or how the backward pass is supposed to work).

<!-- gh-comment-id:3362161072 --> @pdevine commented on GitHub (Oct 2, 2025): @blackccpie something we hacked up. There are a couple PRs we need to get in to fix `create` to work correctly; I actually didn't anticipate that someone would want to fine tune this model given how the MRL tensors work (and in fact, I'm not sure whether you have to thaw those tensors during the fine tune, or how the backward pass is supposed to work).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70312