[GH-ISSUE #12368] Qwen3-Embedding-0.6B failing since new update #70276

Closed
opened 2026-05-04 20:55:14 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @0Tick on GitHub (Sep 22, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12368

What is the issue?

When trying to create an embedding via the API with the Qwen3 Embedding model i get an HTTP 500 telling me that the model doesn't support Embeddings. This only happened after upgrading to the newest version and after reading through the changelog i assume this is since qwen models are now using the ollama backend. Either the ollama backend for qwen models would need embedding support too or an option to change the backend via the modelfile

{
  "model": "hf.co/yomir/Qwen3-Embedding-0.6B-GGUF:F16",
  "input": "Input text"
}
{
  "error": "this model does not support embeddings"
}

Relevant log output

time=2025-09-22T16:08:01.046+02:00 level=INFO source=server.go:399 msg="starting runner" cmd="C:\\Users\\0tick\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\0tick\\.ollama\\models\\blobs\\sha256-25cbcaca565a8e8d0e233656e6098873ada4ff66ce80c34958885ff3f0082800 --port 53871"
time=2025-09-22T16:08:01.058+02:00 level=INFO source=server.go:672 msg="loading model" "model layers"=29 requested=-1
time=2025-09-22T16:08:01.058+02:00 level=INFO source=server.go:678 msg="system memory" total="31.4 GiB" free="9.6 GiB" free_swap="9.9 GiB"
time=2025-09-22T16:08:01.103+02:00 level=INFO source=runner.go:1252 msg="starting ollama engine"
time=2025-09-22T16:08:01.105+02:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:53871"
time=2025-09-22T16:08:01.115+02:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-22T16:08:01.130+02:00 level=INFO source=ggml.go:131 msg="" architecture=qwen3 file_type=F16 name="Qwen3 Embedding 0.6B Bf16" description="" num_tensors=310 num_key_values=28
load_backend: loaded CPU backend from C:\Users\0tick\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-09-22T16:08:01.475+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
time=2025-09-22T16:08:01.479+02:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-22T16:08:01.605+02:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-22T16:08:01.605+02:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.4 GiB"
time=2025-09-22T16:08:01.605+02:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="448.0 MiB"
time=2025-09-22T16:08:01.605+02:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="140.0 MiB"
time=2025-09-22T16:08:01.605+02:00 level=INFO source=backend.go:342 msg="total memory" size="2.0 GiB"
time=2025-09-22T16:08:01.605+02:00 level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU"
time=2025-09-22T16:08:01.605+02:00 level=INFO source=sched.go:470 msg="loaded runners" count=1
time=2025-09-22T16:08:01.605+02:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-09-22T16:08:01.605+02:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU"
time=2025-09-22T16:08:01.605+02:00 level=INFO source=ggml.go:498 msg="offloaded 0/29 layers to GPU"
time=2025-09-22T16:08:01.606+02:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
time=2025-09-22T16:08:02.609+02:00 level=INFO source=server.go:1289 msg="llama runner started in 1.56 seconds"
time=2025-09-22T16:08:02.687+02:00 level=INFO source=server.go:1598 msg="llm embedding error: this model does not support embeddings"
[GIN] 2025/09/22 - 16:08:02 | 500 |    2.2245471s |       127.0.0.1 | POST     "/api/embed"

OS

Windows

GPU

Intel

CPU

Intel

Ollama version

0.12.0

Originally created by @0Tick on GitHub (Sep 22, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12368 ### What is the issue? When trying to create an embedding via the API with the Qwen3 Embedding model i get an HTTP 500 telling me that the model doesn't support Embeddings. This only happened after upgrading to the newest version and after reading through the changelog i assume this is since qwen models are now using the ollama backend. Either the ollama backend for qwen models would need embedding support too or an option to change the backend via the modelfile ```json { "model": "hf.co/yomir/Qwen3-Embedding-0.6B-GGUF:F16", "input": "Input text" } ``` ```json { "error": "this model does not support embeddings" } ``` ### Relevant log output ```shell time=2025-09-22T16:08:01.046+02:00 level=INFO source=server.go:399 msg="starting runner" cmd="C:\\Users\\0tick\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\0tick\\.ollama\\models\\blobs\\sha256-25cbcaca565a8e8d0e233656e6098873ada4ff66ce80c34958885ff3f0082800 --port 53871" time=2025-09-22T16:08:01.058+02:00 level=INFO source=server.go:672 msg="loading model" "model layers"=29 requested=-1 time=2025-09-22T16:08:01.058+02:00 level=INFO source=server.go:678 msg="system memory" total="31.4 GiB" free="9.6 GiB" free_swap="9.9 GiB" time=2025-09-22T16:08:01.103+02:00 level=INFO source=runner.go:1252 msg="starting ollama engine" time=2025-09-22T16:08:01.105+02:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:53871" time=2025-09-22T16:08:01.115+02:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-22T16:08:01.130+02:00 level=INFO source=ggml.go:131 msg="" architecture=qwen3 file_type=F16 name="Qwen3 Embedding 0.6B Bf16" description="" num_tensors=310 num_key_values=28 load_backend: loaded CPU backend from C:\Users\0tick\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-09-22T16:08:01.475+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang) time=2025-09-22T16:08:01.479+02:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-22T16:08:01.605+02:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:6 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-22T16:08:01.605+02:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.4 GiB" time=2025-09-22T16:08:01.605+02:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="448.0 MiB" time=2025-09-22T16:08:01.605+02:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="140.0 MiB" time=2025-09-22T16:08:01.605+02:00 level=INFO source=backend.go:342 msg="total memory" size="2.0 GiB" time=2025-09-22T16:08:01.605+02:00 level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU" time=2025-09-22T16:08:01.605+02:00 level=INFO source=sched.go:470 msg="loaded runners" count=1 time=2025-09-22T16:08:01.605+02:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-09-22T16:08:01.605+02:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU" time=2025-09-22T16:08:01.605+02:00 level=INFO source=ggml.go:498 msg="offloaded 0/29 layers to GPU" time=2025-09-22T16:08:01.606+02:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" time=2025-09-22T16:08:02.609+02:00 level=INFO source=server.go:1289 msg="llama runner started in 1.56 seconds" time=2025-09-22T16:08:02.687+02:00 level=INFO source=server.go:1598 msg="llm embedding error: this model does not support embeddings" [GIN] 2025/09/22 - 16:08:02 | 500 | 2.2245471s | 127.0.0.1 | POST "/api/embed" ``` ### OS Windows ### GPU Intel ### CPU Intel ### Ollama version 0.12.0
GiteaMirror added the bug label 2026-05-04 20:55:14 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 22, 2025):

Try pulling the officially supported version, https://hf.co/Qwen/Qwen3-Embedding-0.6B-GGUF:F16.

$ curl -s http://localhost:11434/api/embed -d '{"model":"hf.co/Qwen/Qwen3-Embedding-0.6B-GGUF:F16","input":"Input text"}' | jq -c '.embeddings[]|.[0:3] + ["..."] + .[-3:]'
[-0.064007714,-0.038458604,-0.0072344183,"...",-0.014235251,0.018021174,-0.004399604]
<!-- gh-comment-id:3319688730 --> @rick-github commented on GitHub (Sep 22, 2025): Try pulling the officially supported version, [https://hf.co/Qwen/Qwen3-Embedding-0.6B-GGUF:F16](https://hf.co/Qwen/Qwen3-Embedding-0.6B-GGUF). ```console $ curl -s http://localhost:11434/api/embed -d '{"model":"hf.co/Qwen/Qwen3-Embedding-0.6B-GGUF:F16","input":"Input text"}' | jq -c '.embeddings[]|.[0:3] + ["..."] + .[-3:]' [-0.064007714,-0.038458604,-0.0072344183,"...",-0.014235251,0.018021174,-0.004399604] ```
Author
Owner

@mxyng commented on GitHub (Sep 22, 2025):

@rick-github is correct. the model you linked doesn't have the characteristics of an embedding model *.pooling_type. while in previous versions of ollama you would've been able to extract embeddings, they would not be able to correctly capture the input the way Qwen3 Embedding 0.6B in sentence-transformers would

<!-- gh-comment-id:3320187800 --> @mxyng commented on GitHub (Sep 22, 2025): @rick-github is correct. the model you linked doesn't have the characteristics of an embedding model `*.pooling_type`. while in previous versions of ollama you would've been able to extract embeddings, they would not be able to correctly capture the input the way [Qwen3 Embedding 0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) in sentence-transformers would
Author
Owner

@pdevine commented on GitHub (Sep 24, 2025):

Going to close this as answered. We can reopen if you're still having problems.

<!-- gh-comment-id:3326125317 --> @pdevine commented on GitHub (Sep 24, 2025): Going to close this as answered. We can reopen if you're still having problems.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70276