[GH-ISSUE #10811] decode: cannot decode batches with this context (use llama_encode() instead) #69159

Closed
opened 2026-05-04 17:18:36 -05:00 by GiteaMirror · 42 comments
Owner

Originally created by @Mihai-CMM on GitHub (May 22, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10811

Hi
Can you please provide some guidance on how i can fix this?

[GIN] 2025/05/22 - 08:11:35 | 200 | 25.004956ms | 192.168.67.41 | POST "/api/embed"
decode: cannot decode batches with this context (use llama_encode() instead)

Originally created by @Mihai-CMM on GitHub (May 22, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10811 Hi Can you please provide some guidance on how i can fix this? [GIN] 2025/05/22 - 08:11:35 | 200 | 25.004956ms | 192.168.67.41 | POST "/api/embed" decode: cannot decode batches with this context (use llama_encode() instead)
Author
Owner

@rick-github commented on GitHub (May 22, 2025):

Model? Input? Logs?

<!-- gh-comment-id:2900879603 --> @rick-github commented on GitHub (May 22, 2025): Model? Input? Logs?
Author
Owner

@Mihai-CMM commented on GitHub (May 23, 2025):

Hello thanks for looking at it. Just like the previous mentioning to the other ticket but i think the problem is more general. In my case i've used different embeding models all returning the same message though to be honest the data is in Qdrant. In the logs I don't see anything else except that message repeating forver after /embeded call. Also using the latest docker images with the same result. If it make any difference i have the ollama plus openwebui installed with the openwebui helm chart. Even if i set OLLAMA_DEBUG True i don't see any other relevant logs. So surely this is reproductible except if i didn't do something wrong with my config hence my initial question.
Thank you

<!-- gh-comment-id:2903310704 --> @Mihai-CMM commented on GitHub (May 23, 2025): Hello thanks for looking at it. Just like the previous mentioning to the other ticket but i think the problem is more general. In my case i've used different embeding models all returning the same message though to be honest the data is in Qdrant. In the logs I don't see anything else except that message repeating forver after /embeded call. Also using the latest docker images with the same result. If it make any difference i have the ollama plus openwebui installed with the openwebui helm chart. Even if i set OLLAMA_DEBUG True i don't see any other relevant logs. So surely this is reproductible except if i didn't do something wrong with my config hence my initial question. Thank you
Author
Owner

@tsly123 commented on GitHub (May 23, 2025):

@Mihai-CMM
Have you tried using embeding models? For my case, I use:

from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2") 

and the response results are much better compared to when using
ollama.embeddings(model="nomic-embed-text:latest", prompt=chunk)["embedding"]

<!-- gh-comment-id:2903359313 --> @tsly123 commented on GitHub (May 23, 2025): @Mihai-CMM Have you tried using embeding models? For my case, I use: ``` from sentence_transformers import SentenceTransformer embedder = SentenceTransformer("all-MiniLM-L6-v2") ``` and the response results are much better compared to when using `ollama.embeddings(model="nomic-embed-text:latest", prompt=chunk)["embedding"]`
Author
Owner

@Mihai-CMM commented on GitHub (May 23, 2025):

No, I have a openwebui ollama bundle and I am not really into using CLI or programmatic approach since I build the infrastructure and I try to evaluate what can be used at the moment . I would expect that the emebeding model to work "out of the box" but you raise a valid point, maybe in my case is the client used by opewebui the issue. For example if i use the "integrated" model of openwebui all-MiniLM-L6-v2 I see no errors and like you mention it seems to provide better results but its not using ollama from my understanding , if i use ollama as embeding engine and nomic-embed-text or mxbai-embed-large both return same errors but like I've mentioned I can see the data in qdrant

<!-- gh-comment-id:2903388168 --> @Mihai-CMM commented on GitHub (May 23, 2025): No, I have a openwebui ollama bundle and I am not really into using CLI or programmatic approach since I build the infrastructure and I try to evaluate what can be used at the moment . I would expect that the emebeding model to work "out of the box" but you raise a valid point, maybe in my case is the client used by opewebui the issue. For example if i use the "integrated" model of openwebui all-MiniLM-L6-v2 I see no errors and like you mention it seems to provide better results but its not using ollama from my understanding , if i use ollama as embeding engine and [nomic-embed-text](https://ollama.com/library/nomic-embed-text) or [mxbai-embed-large](https://ollama.com/library/mxbai-embed-large) both return same errors but like I've mentioned I can see the data in qdrant
Author
Owner

@rick-github commented on GitHub (May 23, 2025):

Just like the previous mentioning to the other ticket but i think the problem is more general.

Which other ticket?

<!-- gh-comment-id:2903613348 --> @rick-github commented on GitHub (May 23, 2025): > Just like the previous mentioning to the other ticket but i think the problem is more general. Which other ticket?
Author
Owner

@Mihai-CMM commented on GitHub (May 23, 2025):

Pls pardon my english meaning:
There was a reply to your first respons Ollama + bge-m3 model throws decode error when used as vectorizer with Weaviate: "cannot decode batches with this context" weaviate/weaviate#8237

What i wanted to add is that this seems to be a more general issue since i tried more embeding models and i have no clue if its a real issue since I have data in RAG

<!-- gh-comment-id:2903619784 --> @Mihai-CMM commented on GitHub (May 23, 2025): Pls pardon my english meaning: There was a reply to your first respons Ollama + bge-m3 model throws decode error when used as vectorizer with Weaviate: "cannot decode batches with this context" weaviate/weaviate#8237 What i wanted to add is that this seems to be a more general issue since i tried more embeding models and i have no clue if its a real issue since I have data in RAG
Author
Owner

@hillar commented on GitHub (May 24, 2025):

% OLLAMA_DEBUG=1 ollama serve      
            
time=2025-05-24T12:32:14.603+03:00 level=INFO source=routes.go:1205 msg="server config" env="map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/hillar/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false http_proxy: https_proxy: no_proxy:]"
time=2025-05-24T12:32:14.609+03:00 level=INFO source=images.go:463 msg="total blobs: 34"
time=2025-05-24T12:32:14.609+03:00 level=INFO source=images.go:470 msg="total unused blobs removed: 0"
time=2025-05-24T12:32:14.610+03:00 level=INFO source=routes.go:1258 msg="Listening on 127.0.0.1:11434 (version 0.7.1)"
time=2025-05-24T12:32:14.610+03:00 level=DEBUG source=sched.go:108 msg="starting llm scheduler"
time=2025-05-24T12:32:14.639+03:00 level=INFO source=types.go:130 msg="inference compute" id=0 library=metal variant="" compute="" driver=0.0 name="" total="16.0 GiB" available="16.0 GiB"
time=2025-05-24T12:32:23.945+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-05-24T12:32:23.945+03:00 level=DEBUG source=sched.go:185 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-05-24T12:32:23.961+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-05-24T12:32:23.976+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-05-24T12:32:23.976+03:00 level=DEBUG source=sched.go:228 msg="loading first model" model=/Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c
time=2025-05-24T12:32:23.976+03:00 level=DEBUG source=memory.go:111 msg=evaluating library=metal gpu_count=1 available="[16.0 GiB]"
time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.vision.block_count default=0
time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.key_length default=64
time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.value_length default=64
time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-05-24T12:32:23.977+03:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c gpu=0 parallel=1 available=17179885568 required="1.7 GiB"
time=2025-05-24T12:32:23.977+03:00 level=INFO source=server.go:135 msg="system memory" total="24.0 GiB" free="5.6 GiB" free_swap="0 B"
time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=memory.go:111 msg=evaluating library=metal gpu_count=1 available="[16.0 GiB]"
time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.vision.block_count default=0
time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.key_length default=64
time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.value_length default=64
time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-05-24T12:32:23.977+03:00 level=INFO source=server.go:168 msg=offload library=metal layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[16.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.7 GiB" memory.required.partial="1.7 GiB" memory.required.kv="48.0 MiB" memory.required.allocations="[1.7 GiB]" memory.weights.total="1.0 GiB" memory.weights.repeating="577.2 MiB" memory.weights.nonrepeating="488.3 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"
time=2025-05-24T12:32:23.978+03:00 level=DEBUG source=server.go:284 msg="compatible gpu libraries" compatible=[]
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) - 16383 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 389 tensors from /Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 567M
llama_model_loader: - kv   3:                            general.license str              = mit
llama_model_loader: - kv   4:                               general.tags arr[str,4]       = ["sentence-transformers", "feature-ex...
llama_model_loader: - kv   5:                           bert.block_count u32              = 24
llama_model_loader: - kv   6:                        bert.context_length u32              = 8192
llama_model_loader: - kv   7:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv   8:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv   9:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv  10:          bert.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                          general.file_type u32              = 1
llama_model_loader: - kv  12:                      bert.attention.causal bool             = false
llama_model_loader: - kv  13:                          bert.pooling_type u32              = 2
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = t5
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,250002]  = ["<s>", "<pad>", "</s>", "<unk>", ","...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,250002]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,250002]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.add_space_prefix bool             = true
llama_model_loader: - kv  20:            tokenizer.ggml.token_type_count u32              = 1
llama_model_loader: - kv  21:    tokenizer.ggml.remove_extra_whitespaces bool             = true
llama_model_loader: - kv  22:        tokenizer.ggml.precompiled_charsmap arr[u8,237539]   = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,...
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  25:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  26:          tokenizer.ggml.seperator_token_id u32              = 2
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  28:                tokenizer.ggml.cls_token_id u32              = 0
llama_model_loader: - kv  29:               tokenizer.ggml.mask_token_id u32              = 250001
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  31:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  244 tensors
llama_model_loader: - type  f16:  145 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 1.07 GiB (16.25 BPW) 
init_tokenizer: initializing tokenizer for type 4
load: model vocab missing newline token, using special_pad_id instead
load: control token:      0 '<s>' is not marked as EOG
load: control token:      2 '</s>' is not marked as EOG
load: control token:      1 '<pad>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 4
load: token to piece cache size = 2.1668 MB
print_info: arch             = bert
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 566.70 M
print_info: general.name     = n/a
print_info: vocab type       = UGM
print_info: n_vocab          = 250002
print_info: n_merges         = 0
print_info: BOS token        = 0 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 3 '<unk>'
print_info: SEP token        = 2 '</s>'
print_info: PAD token        = 1 '<pad>'
print_info: MASK token       = 250001 '[PAD250000]'
print_info: LF token         = 0 '<s>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
llama_model_load: vocab only - skipping tensors
time=2025-05-24T12:32:24.191+03:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c --ctx-size 8192 --batch-size 512 --n-gpu-layers 25 --threads 10 --parallel 1 --port 55817"
time=2025-05-24T12:32:24.191+03:00 level=DEBUG source=server.go:432 msg=subprocess PATH=/opt/homebrew/bin:/opt/homebrew/sbin:/Users/hillar/.bun/bin:/Users/hillar/.cargo/bin:/opt/local/bin:/opt/local/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Library/Apple/usr/bin:/opt/podman/bin OLLAMA_DEBUG=1 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources DYLD_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources:/Applications/Ollama.app/Contents/Resources
time=2025-05-24T12:32:24.196+03:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-05-24T12:32:24.196+03:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-05-24T12:32:24.197+03:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-05-24T12:32:24.215+03:00 level=INFO source=runner.go:815 msg="starting go runner"
time=2025-05-24T12:32:24.215+03:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/Applications/Ollama.app/Contents/Resources
time=2025-05-24T12:32:24.218+03:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
time=2025-05-24T12:32:24.222+03:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:55817"
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) - 16383 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 389 tensors from /Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 567M
llama_model_loader: - kv   3:                            general.license str              = mit
llama_model_loader: - kv   4:                               general.tags arr[str,4]       = ["sentence-transformers", "feature-ex...
llama_model_loader: - kv   5:                           bert.block_count u32              = 24
llama_model_loader: - kv   6:                        bert.context_length u32              = 8192
llama_model_loader: - kv   7:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv   8:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv   9:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv  10:          bert.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                          general.file_type u32              = 1
llama_model_loader: - kv  12:                      bert.attention.causal bool             = false
llama_model_loader: - kv  13:                          bert.pooling_type u32              = 2
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = t5
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,250002]  = ["<s>", "<pad>", "</s>", "<unk>", ","...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,250002]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,250002]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.add_space_prefix bool             = true
llama_model_loader: - kv  20:            tokenizer.ggml.token_type_count u32              = 1
llama_model_loader: - kv  21:    tokenizer.ggml.remove_extra_whitespaces bool             = true
llama_model_loader: - kv  22:        tokenizer.ggml.precompiled_charsmap arr[u8,237539]   = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,...
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  25:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  26:          tokenizer.ggml.seperator_token_id u32              = 2
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  28:                tokenizer.ggml.cls_token_id u32              = 0
llama_model_loader: - kv  29:               tokenizer.ggml.mask_token_id u32              = 250001
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  31:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  244 tensors
llama_model_loader: - type  f16:  145 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 1.07 GiB (16.25 BPW) 
init_tokenizer: initializing tokenizer for type 4
load: model vocab missing newline token, using special_pad_id instead
load: control token:      0 '<s>' is not marked as EOG
load: control token:      2 '</s>' is not marked as EOG
load: control token:      1 '<pad>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 4
time=2025-05-24T12:32:24.449+03:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
load: token to piece cache size = 2.1668 MB
print_info: arch             = bert
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 1024
print_info: n_layer          = 24
print_info: n_head           = 16
print_info: n_head_kv        = 16
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 1.0e-05
print_info: f_norm_rms_eps   = 0.0e+00
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 4096
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 0
print_info: pooling type     = 2
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 335M
print_info: model params     = 566.70 M
print_info: general.name     = n/a
print_info: vocab type       = UGM
print_info: n_vocab          = 250002
print_info: n_merges         = 0
print_info: BOS token        = 0 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 3 '<unk>'
print_info: SEP token        = 2 '</s>'
print_info: PAD token        = 1 '<pad>'
print_info: MASK token       = 250001 '[PAD250000]'
print_info: LF token         = 0 '<s>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device Metal, is_swa = 0
load_tensors: layer   1 assigned to device Metal, is_swa = 0
load_tensors: layer   2 assigned to device Metal, is_swa = 0
load_tensors: layer   3 assigned to device Metal, is_swa = 0
load_tensors: layer   4 assigned to device Metal, is_swa = 0
load_tensors: layer   5 assigned to device Metal, is_swa = 0
load_tensors: layer   6 assigned to device Metal, is_swa = 0
load_tensors: layer   7 assigned to device Metal, is_swa = 0
load_tensors: layer   8 assigned to device Metal, is_swa = 0
load_tensors: layer   9 assigned to device Metal, is_swa = 0
load_tensors: layer  10 assigned to device Metal, is_swa = 0
load_tensors: layer  11 assigned to device Metal, is_swa = 0
load_tensors: layer  12 assigned to device Metal, is_swa = 0
load_tensors: layer  13 assigned to device Metal, is_swa = 0
load_tensors: layer  14 assigned to device Metal, is_swa = 0
load_tensors: layer  15 assigned to device Metal, is_swa = 0
load_tensors: layer  16 assigned to device Metal, is_swa = 0
load_tensors: layer  17 assigned to device Metal, is_swa = 0
load_tensors: layer  18 assigned to device Metal, is_swa = 0
load_tensors: layer  19 assigned to device Metal, is_swa = 0
load_tensors: layer  20 assigned to device Metal, is_swa = 0
load_tensors: layer  21 assigned to device Metal, is_swa = 0
load_tensors: layer  22 assigned to device Metal, is_swa = 0
load_tensors: layer  23 assigned to device Metal, is_swa = 0
load_tensors: layer  24 assigned to device Metal, is_swa = 0
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   520.30 MiB
load_tensors: Metal_Mapped model buffer size =   577.23 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 0
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M4 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = false
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 17179.89 MB
ggml_metal_init: loaded kernel_add                                    0x14e208ae0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_add_row                                0x14e209290 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_sub                                    0x14e209840 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_sub_row                                0x14e209df0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul                                    0x14e20a3a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_row                                0x14e20a950 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_div                                    0x14e20af00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_div_row                                0x14e20b4b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_repeat_f32                             0x14e20ba60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_repeat_f16                             0x14e20bf60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_repeat_i32                             0x14e20c460 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_repeat_i16                             0x14e20c960 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_scale                                  0x14e20d480 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_scale_4                                0x14e20dc30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_clamp                                  0x14e20e440 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_tanh                                   0x14e20eb60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_relu                                   0x14e20f280 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_sigmoid                                0x14e20f9a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_gelu                                   0x14e2100c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_gelu_4                                 0x14e210890 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_gelu_quick                             0x14e210fb0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_gelu_quick_4                           0x14e2116d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_silu                                   0x14e211df0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_silu_4                                 0x14e212690 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_elu                                    0x14e212db0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max_f16                           0x14e213250 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max_f16_4                         0x14e2136f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max_f32                           0x14e213d90 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max_f32_4                         0x14e214230 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_diag_mask_inf                          0x14e2146d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_diag_mask_inf_8                        0x14e214990 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_f32                           0x14e215080 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_f16                           0x14e215340 | th_max = 1024 | th_width =   32
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: loaded kernel_get_rows_q4_0                          0x14e2157e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_1                          0x14e215c80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_0                          0x14e216120 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_1                          0x14e2165c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q8_0                          0x14e216a60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q2_K                          0x14e216f00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q3_K                          0x14e2173a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_K                          0x14e217840 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_K                          0x14e217ce0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q6_K                          0x14e218180 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_iq2_xxs                       0x14e218440 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_iq2_xs                        0x14e218950 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_iq3_xxs                       0x14e218e60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_iq3_s                         0x14e219550 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_iq2_s                         0x14e219d00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_iq1_s                         0x14e21a1a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_iq1_m                         0x14e21a640 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_iq4_nl                        0x14e21aae0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_iq4_xs                        0x14e21af80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_i32                           0x14e21b420 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rms_norm                               0x14e21b8c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_l2_norm                                0x14e21bd60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_group_norm                             0x14e21c200 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_norm                                   0x14e21c6a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_ssm_conv_f32                           0x14e21cb40 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_ssm_scan_f32                           0x14e21d090 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rwkv_wkv6_f32                          0x14e21d530 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rwkv_wkv7_f32                          0x14e21d7f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f32_f32                         0x14e21dc90 | th_max = 1024 | th_width =   32
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: loaded kernel_mul_mv_f16_f32                         0x14e21e130 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f16_f32_1row                    0x14e21e5d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f16_f32_l4                      0x14e21ea70 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f16_f16                         0x14e21ef10 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q4_0_f32                        0x14e21f3b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q4_1_f32                        0x14e21f850 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q5_0_f32                        0x14e21fcf0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q5_1_f32                        0x14e220190 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q8_0_f32                        0x14e220630 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_f16_f32_r1_2                0x14e220b80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_f16_f32_r1_3                0x14e2210d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_f16_f32_r1_4                0x14e221620 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_f16_f32_r1_5                0x14e221b70 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q4_0_f32_r1_2               0x14e2220c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q4_0_f32_r1_3               0x14e222610 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q4_0_f32_r1_4               0x14e222b60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q4_0_f32_r1_5               0x14e2230b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q4_1_f32_r1_2               0x14e223600 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q4_1_f32_r1_3               0x14e223b50 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q4_1_f32_r1_4               0x14e2240a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q4_1_f32_r1_5               0x14e2245f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q5_0_f32_r1_2               0x14e224b40 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q5_0_f32_r1_3               0x14e225090 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q5_0_f32_r1_4               0x14e2255e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q5_0_f32_r1_5               0x14e225b30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q5_1_f32_r1_2               0x14e226080 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q5_1_f32_r1_3               0x14e2265d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q5_1_f32_r1_4               0x14e226b20 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q5_1_f32_r1_5               0x14e227070 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q8_0_f32_r1_2               0x14e2275c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q8_0_f32_r1_3               0x14e227b10 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q8_0_f32_r1_4               0x14e228060 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q8_0_f32_r1_5               0x14e2285b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q4_K_f32_r1_2               0x14e228b00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q4_K_f32_r1_3               0x14e229050 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q4_K_f32_r1_4               0x14e219810 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q4_K_f32_r1_5               0x14e2294c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q5_K_f32_r1_2               0x14e229c70 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q5_K_f32_r1_3               0x14e22a1c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q5_K_f32_r1_4               0x14e22a710 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q5_K_f32_r1_5               0x14e22ac60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q6_K_f32_r1_2               0x14e22b1b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q6_K_f32_r1_3               0x14e22b700 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q6_K_f32_r1_4               0x14e22bc50 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_q6_K_f32_r1_5               0x14e22c1a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_iq4_nl_f32_r1_2             0x14e22c6f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_iq4_nl_f32_r1_3             0x14e22cc40 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_iq4_nl_f32_r1_4             0x14e22d190 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_ext_iq4_nl_f32_r1_5             0x14e22d6e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q2_K_f32                        0x14e22db80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q3_K_f32                        0x14e22e020 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q4_K_f32                        0x14e22e4c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q5_K_f32                        0x14e22e960 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q6_K_f32                        0x14e22ee00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_iq2_xxs_f32                     0x14e22f2a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_iq2_xs_f32                      0x14e22f740 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_iq3_xxs_f32                     0x14e22fbe0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_iq3_s_f32                       0x14e230080 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_iq2_s_f32                       0x14e230520 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_iq1_s_f32                       0x14e2309c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_iq1_m_f32                       0x14e230e60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_iq4_nl_f32                      0x14e231300 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_iq4_xs_f32                      0x14e2317a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_f32_f32                      0x14e231c40 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_f16_f32                      0x14e2320e0 | th_max = 1024 | th_width =   32
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: loaded kernel_mul_mv_id_q4_0_f32                     0x14e232580 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_q4_1_f32                     0x14e232a20 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_q5_0_f32                     0x14e232ec0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_q5_1_f32                     0x14e233360 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_q8_0_f32                     0x14e233800 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_q2_K_f32                     0x14e233ca0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_q3_K_f32                     0x14e234140 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_q4_K_f32                     0x14e2345e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_q5_K_f32                     0x14e234a80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_q6_K_f32                     0x14e234f20 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_iq2_xxs_f32                  0x14e2353c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_iq2_xs_f32                   0x14e235860 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_iq3_xxs_f32                  0x14e235d00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_iq3_s_f32                    0x14e2361a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_iq2_s_f32                    0x14e236640 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_iq1_s_f32                    0x14e236ae0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_iq1_m_f32                    0x14e236f80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_iq4_nl_f32                   0x14e237420 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_id_iq4_xs_f32                   0x14e2378c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_f32_f32                         0x14e237d60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_f16_f32                         0x14e238200 | th_max = 1024 | th_width =   32
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32                        0x14e2386a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32                        0x14e238b40 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_0_f32                        0x14e238fe0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_1_f32                        0x14e239480 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q8_0_f32                        0x14e239920 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32                        0x14e239dc0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32                        0x14e23a260 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32                        0x14e23a700 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32                        0x14e23aba0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32                        0x14e23b040 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_iq2_xxs_f32                     0x14e23b4e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_iq2_xs_f32                      0x14e23b980 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_iq3_xxs_f32                     0x14e23be20 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_iq3_s_f32                       0x14e23c2c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_iq2_s_f32                       0x14e23c760 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_iq1_s_f32                       0x14e23cc00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_iq1_m_f32                       0x14e23d0a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_iq4_nl_f32                      0x14e23d540 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_iq4_xs_f32                      0x14e23d9e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_map0_f16                     0x14e23de80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_map1_f32                     0x14e23e320 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_f32_f16                      0x14e23e7c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_f16_f16                      0x14e23ec60 | th_max = 1024 | th_width =   32
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: loaded kernel_mul_mm_id_q4_0_f16                     0x14e23f100 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_q4_1_f16                     0x14e23f5a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_q5_0_f16                     0x14e23fa40 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_q5_1_f16                     0x14e23fee0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_q8_0_f16                     0x14e240380 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_q2_K_f16                     0x14e240820 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_q3_K_f16                     0x14e240cc0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_q4_K_f16                     0x14e241160 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_q5_K_f16                     0x14e241600 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_q6_K_f16                     0x14e241aa0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_iq2_xxs_f16                  0x14e241f40 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_iq2_xs_f16                   0x14e2423e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_iq3_xxs_f16                  0x14e242880 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_iq3_s_f16                    0x14e242d20 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_iq2_s_f16                    0x14e2431c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_iq1_s_f16                    0x14e243660 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_iq1_m_f16                    0x14e243b00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_iq4_nl_f16                   0x14e243fa0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_id_iq4_xs_f16                   0x14e244440 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rope_norm_f32                          0x14e244990 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rope_norm_f16                          0x14e244ee0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rope_multi_f32                         0x14e245430 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rope_multi_f16                         0x14e245980 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rope_vision_f32                        0x14e245ed0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rope_vision_f16                        0x14e246420 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rope_neox_f32                          0x14e246970 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rope_neox_f16                          0x14e246ec0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_im2col_f16                             0x14e247360 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_im2col_f32                             0x14e247800 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_im2col_ext_f16                         0x14e247ca0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_im2col_ext_f32                         0x14e248140 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_conv_transpose_1d_f32_f32              0x14e2485e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_conv_transpose_1d_f16_f32              0x14e248a80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_upscale_f32                            0x14e248fd0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_pad_f32                                0x14e249470 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_pad_reflect_1d_f32                     0x14e249910 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_timestep_embedding_f32                 0x14e249db0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_arange_f32                             0x14e24a250 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_argsort_f32_i32_asc                    0x14e24a6f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_argsort_f32_i32_desc                   0x14e24ab90 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_leaky_relu_f32                         0x14e24b3e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_f16_h64                 0x14e24b930 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_f16_h80                 0x14e24be80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_f16_h96                 0x14e24c3d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_f16_h112                0x14e24c920 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_f16_h128                0x14e24ce70 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_f16_h192                0x14e24d3c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_f16_hk192_hv128         0x14e24d910 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_f16_h256                0x14e24de60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_f16_hk576_hv512         0x14e24e3b0 | th_max = 1024 | th_width =   32
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h64                0x14e24e900 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h80                0x14e24ee50 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h96                0x14e24f3a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h112               0x14e24f8f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h128               0x14e24fe40 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h192               0x14e250390 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_hk192_hv128        0x14e250650 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h256               0x14e250eb0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_hk576_hv512        0x14e251170 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h64                0x14e2519d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h80                0x14e251f20 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h96                0x14e252470 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h112               0x14e2529c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h128               0x14e252f10 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h192               0x14e253460 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_hk192_hv128        0x14e253720 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h256               0x14e253f80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_hk576_hv512        0x14e254240 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h64                0x14e254aa0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h80                0x14e254ff0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h96                0x14e255540 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h112               0x14e255a90 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h128               0x14e255fe0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h192               0x14e256530 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_hk192_hv128        0x14e2567f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h256               0x14e257050 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_hk576_hv512        0x14e257310 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h64                0x14e257b70 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h80                0x14e2580c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h96                0x14e258610 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h112               0x14e258b60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h128               0x14e2590b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h192               0x14e259600 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_hk192_hv128        0x14e2598c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h256               0x14e25a120 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_hk576_hv512        0x14e25a3e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h64                0x14e25ac40 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h80                0x14e25b190 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h96                0x14e25b6e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h112               0x14e25bc30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h128               0x14e25c180 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h192               0x14e25c6d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_hk192_hv128        0x14e25c990 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h256               0x14e25d1f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_hk576_hv512        0x14e25d4b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_f16_h96             0x14e25dd10 | th_max = 1024 | th_width =   32
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_0_h96            0x14e25e260 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_1_h96            0x14e25e7b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_0_h96            0x14e25ed00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_1_h96            0x14e25f250 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q8_0_h96            0x14e25f7a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_f16_h128            0x14e25fcf0 | th_max = 1024 | th_width =   32
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_0_h128           0x14e260240 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_1_h128           0x14e260790 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_0_h128           0x14e260ce0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_1_h128           0x14e261230 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q8_0_h128           0x14e261780 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_f16_h192            0x14e261cd0 | th_max = 1024 | th_width =   32
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_0_h192           0x14e262220 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_1_h192           0x14e262770 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_0_h192           0x14e262cc0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_1_h192           0x14e263210 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q8_0_h192           0x14e263760 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_f16_hk192_hv128      0x14e263a20 | th_max = 1024 | th_width =   32
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_0_hk192_hv128      0x14e263ff0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_1_hk192_hv128      0x14e2645c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_0_hk192_hv128      0x14e264b90 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_1_hk192_hv128      0x14e265160 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q8_0_hk192_hv128      0x14e265730 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_f16_h256            0x14e265f90 | th_max = 1024 | th_width =   32
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_0_h256           0x14e2664e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_1_h256           0x14e266a30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_0_h256           0x14e266f80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_1_h256           0x14e2674d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q8_0_h256           0x14e267a20 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_f16_hk576_hv512      0x14e267ce0 | th_max = 1024 | th_width =   32
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_0_hk576_hv512      0x14e2682b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_1_hk576_hv512      0x14e268880 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_0_hk576_hv512      0x14e268e50 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_1_hk576_hv512      0x14e269420 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_flash_attn_ext_vec_q8_0_hk576_hv512      0x14e2699f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_set_f32                                0x14e26a1a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_set_i32                                0x14e26a640 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f32                            0x14e26aae0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f16                            0x14e26af80 | th_max = 1024 | th_width =   32
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: loaded kernel_cpy_f16_f32                            0x14e26b420 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f16_f16                            0x14e26b8c0 | th_max = 1024 | th_width =   32
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
ggml_metal_init: loaded kernel_cpy_f32_q8_0                           0x14e26bd60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_q4_0                           0x14e26c200 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_q4_1                           0x14e26c6a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_q5_0                           0x14e26cb40 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_q5_1                           0x14e26cfe0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_iq4_nl                         0x14e26d480 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_q4_0_f32                           0x14e26d920 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_q4_0_f16                           0x14e26ddc0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_q4_1_f32                           0x14e26e260 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_q4_1_f16                           0x14e26e700 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_q5_0_f32                           0x14e26eba0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_q5_0_f16                           0x14e26f040 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_q5_1_f32                           0x14e26f4e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_q5_1_f16                           0x14e26f980 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_q8_0_f32                           0x14e26fe20 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_q8_0_f16                           0x14e2702c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_concat                                 0x14e270810 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_sqr                                    0x14e270f30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_sqrt                                   0x14e271650 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_sin                                    0x14e271d70 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cos                                    0x14e272490 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_neg                                    0x14e272bb0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_sum_rows                               0x14e273100 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_argmax                                 0x14e2735a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_pool_2d_avg_f32                        0x14e273a40 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_pool_2d_max_f32                        0x14e273ee0 | th_max = 1024 | th_width =   32
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: max_nodes = 65536
time=2025-05-24T12:32:24.951+03:00 level=INFO source=server.go:630 msg="llama runner started in 0.76 seconds"
time=2025-05-24T12:32:24.951+03:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/bge-m3:567m runner.inference=metal runner.devices=1 runner.size="1.7 GiB" runner.vram="1.7 GiB" runner.parallel=1 runner.pid=82229 runner.model=/Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c runner.num_ctx=8192
time=2025-05-24T12:32:24.957+03:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=3807 used=0 remaining=3807
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
[GIN] 2025/05/24 - 12:32:25 | 200 |  1.637294875s |       127.0.0.1 | POST     "/api/embeddings"
time=2025-05-24T12:32:25.562+03:00 level=DEBUG source=sched.go:503 msg="context for request finished"
time=2025-05-24T12:32:25.562+03:00 level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/bge-m3:567m runner.inference=metal runner.devices=1 runner.size="1.7 GiB" runner.vram="1.7 GiB" runner.parallel=1 runner.pid=82229 runner.model=/Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c runner.num_ctx=8192 duration=5m0s
time=2025-05-24T12:32:25.562+03:00 level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/bge-m3:567m runner.inference=metal runner.devices=1 runner.size="1.7 GiB" runner.vram="1.7 GiB" runner.parallel=1 runner.pid=82229 runner.model=/Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c runner.num_ctx=8192 refCount=0
time=2025-05-24T12:32:25.606+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32
time=2025-05-24T12:32:25.606+03:00 level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=/Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c
time=2025-05-24T12:32:25.610+03:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=3807 prompt=3392 used=0 remaining=3392
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
<!-- gh-comment-id:2906700986 --> @hillar commented on GitHub (May 24, 2025): ``` % OLLAMA_DEBUG=1 ollama serve time=2025-05-24T12:32:14.603+03:00 level=INFO source=routes.go:1205 msg="server config" env="map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/hillar/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false http_proxy: https_proxy: no_proxy:]" time=2025-05-24T12:32:14.609+03:00 level=INFO source=images.go:463 msg="total blobs: 34" time=2025-05-24T12:32:14.609+03:00 level=INFO source=images.go:470 msg="total unused blobs removed: 0" time=2025-05-24T12:32:14.610+03:00 level=INFO source=routes.go:1258 msg="Listening on 127.0.0.1:11434 (version 0.7.1)" time=2025-05-24T12:32:14.610+03:00 level=DEBUG source=sched.go:108 msg="starting llm scheduler" time=2025-05-24T12:32:14.639+03:00 level=INFO source=types.go:130 msg="inference compute" id=0 library=metal variant="" compute="" driver=0.0 name="" total="16.0 GiB" available="16.0 GiB" time=2025-05-24T12:32:23.945+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-05-24T12:32:23.945+03:00 level=DEBUG source=sched.go:185 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-05-24T12:32:23.961+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-05-24T12:32:23.976+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-05-24T12:32:23.976+03:00 level=DEBUG source=sched.go:228 msg="loading first model" model=/Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c time=2025-05-24T12:32:23.976+03:00 level=DEBUG source=memory.go:111 msg=evaluating library=metal gpu_count=1 available="[16.0 GiB]" time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.vision.block_count default=0 time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.key_length default=64 time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.value_length default=64 time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-05-24T12:32:23.977+03:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c gpu=0 parallel=1 available=17179885568 required="1.7 GiB" time=2025-05-24T12:32:23.977+03:00 level=INFO source=server.go:135 msg="system memory" total="24.0 GiB" free="5.6 GiB" free_swap="0 B" time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=memory.go:111 msg=evaluating library=metal gpu_count=1 available="[16.0 GiB]" time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.vision.block_count default=0 time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.key_length default=64 time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.value_length default=64 time=2025-05-24T12:32:23.977+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-05-24T12:32:23.977+03:00 level=INFO source=server.go:168 msg=offload library=metal layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[16.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.7 GiB" memory.required.partial="1.7 GiB" memory.required.kv="48.0 MiB" memory.required.allocations="[1.7 GiB]" memory.weights.total="1.0 GiB" memory.weights.repeating="577.2 MiB" memory.weights.nonrepeating="488.3 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB" time=2025-05-24T12:32:23.978+03:00 level=DEBUG source=server.go:284 msg="compatible gpu libraries" compatible=[] llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) - 16383 MiB free llama_model_loader: loaded meta data with 33 key-value pairs and 389 tensors from /Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.size_label str = 567M llama_model_loader: - kv 3: general.license str = mit llama_model_loader: - kv 4: general.tags arr[str,4] = ["sentence-transformers", "feature-ex... llama_model_loader: - kv 5: bert.block_count u32 = 24 llama_model_loader: - kv 6: bert.context_length u32 = 8192 llama_model_loader: - kv 7: bert.embedding_length u32 = 1024 llama_model_loader: - kv 8: bert.feed_forward_length u32 = 4096 llama_model_loader: - kv 9: bert.attention.head_count u32 = 16 llama_model_loader: - kv 10: bert.attention.layer_norm_epsilon f32 = 0.000010 llama_model_loader: - kv 11: general.file_type u32 = 1 llama_model_loader: - kv 12: bert.attention.causal bool = false llama_model_loader: - kv 13: bert.pooling_type u32 = 2 llama_model_loader: - kv 14: tokenizer.ggml.model str = t5 llama_model_loader: - kv 15: tokenizer.ggml.pre str = default llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,250002] = ["<s>", "<pad>", "</s>", "<unk>", ","... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,250002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,250002] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.add_space_prefix bool = true llama_model_loader: - kv 20: tokenizer.ggml.token_type_count u32 = 1 llama_model_loader: - kv 21: tokenizer.ggml.remove_extra_whitespaces bool = true llama_model_loader: - kv 22: tokenizer.ggml.precompiled_charsmap arr[u8,237539] = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,... llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 25: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 26: tokenizer.ggml.seperator_token_id u32 = 2 llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 1 llama_model_loader: - kv 28: tokenizer.ggml.cls_token_id u32 = 0 llama_model_loader: - kv 29: tokenizer.ggml.mask_token_id u32 = 250001 llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = true llama_model_loader: - kv 32: general.quantization_version u32 = 2 llama_model_loader: - type f32: 244 tensors llama_model_loader: - type f16: 145 tensors print_info: file format = GGUF V3 (latest) print_info: file type = F16 print_info: file size = 1.07 GiB (16.25 BPW) init_tokenizer: initializing tokenizer for type 4 load: model vocab missing newline token, using special_pad_id instead load: control token: 0 '<s>' is not marked as EOG load: control token: 2 '</s>' is not marked as EOG load: control token: 1 '<pad>' is not marked as EOG load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 4 load: token to piece cache size = 2.1668 MB print_info: arch = bert print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 566.70 M print_info: general.name = n/a print_info: vocab type = UGM print_info: n_vocab = 250002 print_info: n_merges = 0 print_info: BOS token = 0 '<s>' print_info: EOS token = 2 '</s>' print_info: UNK token = 3 '<unk>' print_info: SEP token = 2 '</s>' print_info: PAD token = 1 '<pad>' print_info: MASK token = 250001 '[PAD250000]' print_info: LF token = 0 '<s>' print_info: EOG token = 2 '</s>' print_info: max token length = 48 llama_model_load: vocab only - skipping tensors time=2025-05-24T12:32:24.191+03:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c --ctx-size 8192 --batch-size 512 --n-gpu-layers 25 --threads 10 --parallel 1 --port 55817" time=2025-05-24T12:32:24.191+03:00 level=DEBUG source=server.go:432 msg=subprocess PATH=/opt/homebrew/bin:/opt/homebrew/sbin:/Users/hillar/.bun/bin:/Users/hillar/.cargo/bin:/opt/local/bin:/opt/local/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Library/Apple/usr/bin:/opt/podman/bin OLLAMA_DEBUG=1 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources DYLD_LIBRARY_PATH=/Applications/Ollama.app/Contents/Resources:/Applications/Ollama.app/Contents/Resources time=2025-05-24T12:32:24.196+03:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-05-24T12:32:24.196+03:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-05-24T12:32:24.197+03:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-05-24T12:32:24.215+03:00 level=INFO source=runner.go:815 msg="starting go runner" time=2025-05-24T12:32:24.215+03:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/Applications/Ollama.app/Contents/Resources time=2025-05-24T12:32:24.218+03:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) time=2025-05-24T12:32:24.222+03:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:55817" llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) - 16383 MiB free llama_model_loader: loaded meta data with 33 key-value pairs and 389 tensors from /Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.size_label str = 567M llama_model_loader: - kv 3: general.license str = mit llama_model_loader: - kv 4: general.tags arr[str,4] = ["sentence-transformers", "feature-ex... llama_model_loader: - kv 5: bert.block_count u32 = 24 llama_model_loader: - kv 6: bert.context_length u32 = 8192 llama_model_loader: - kv 7: bert.embedding_length u32 = 1024 llama_model_loader: - kv 8: bert.feed_forward_length u32 = 4096 llama_model_loader: - kv 9: bert.attention.head_count u32 = 16 llama_model_loader: - kv 10: bert.attention.layer_norm_epsilon f32 = 0.000010 llama_model_loader: - kv 11: general.file_type u32 = 1 llama_model_loader: - kv 12: bert.attention.causal bool = false llama_model_loader: - kv 13: bert.pooling_type u32 = 2 llama_model_loader: - kv 14: tokenizer.ggml.model str = t5 llama_model_loader: - kv 15: tokenizer.ggml.pre str = default llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,250002] = ["<s>", "<pad>", "</s>", "<unk>", ","... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,250002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,250002] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.add_space_prefix bool = true llama_model_loader: - kv 20: tokenizer.ggml.token_type_count u32 = 1 llama_model_loader: - kv 21: tokenizer.ggml.remove_extra_whitespaces bool = true llama_model_loader: - kv 22: tokenizer.ggml.precompiled_charsmap arr[u8,237539] = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,... llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 25: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 26: tokenizer.ggml.seperator_token_id u32 = 2 llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 1 llama_model_loader: - kv 28: tokenizer.ggml.cls_token_id u32 = 0 llama_model_loader: - kv 29: tokenizer.ggml.mask_token_id u32 = 250001 llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = true llama_model_loader: - kv 32: general.quantization_version u32 = 2 llama_model_loader: - type f32: 244 tensors llama_model_loader: - type f16: 145 tensors print_info: file format = GGUF V3 (latest) print_info: file type = F16 print_info: file size = 1.07 GiB (16.25 BPW) init_tokenizer: initializing tokenizer for type 4 load: model vocab missing newline token, using special_pad_id instead load: control token: 0 '<s>' is not marked as EOG load: control token: 2 '</s>' is not marked as EOG load: control token: 1 '<pad>' is not marked as EOG load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 4 time=2025-05-24T12:32:24.449+03:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" load: token to piece cache size = 2.1668 MB print_info: arch = bert print_info: vocab_only = 0 print_info: n_ctx_train = 8192 print_info: n_embd = 1024 print_info: n_layer = 24 print_info: n_head = 16 print_info: n_head_kv = 16 print_info: n_rot = 64 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 64 print_info: n_embd_head_v = 64 print_info: n_gqa = 1 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 1.0e-05 print_info: f_norm_rms_eps = 0.0e+00 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 4096 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 0 print_info: pooling type = 2 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 8192 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 335M print_info: model params = 566.70 M print_info: general.name = n/a print_info: vocab type = UGM print_info: n_vocab = 250002 print_info: n_merges = 0 print_info: BOS token = 0 '<s>' print_info: EOS token = 2 '</s>' print_info: UNK token = 3 '<unk>' print_info: SEP token = 2 '</s>' print_info: PAD token = 1 '<pad>' print_info: MASK token = 250001 '[PAD250000]' print_info: LF token = 0 '<s>' print_info: EOG token = 2 '</s>' print_info: max token length = 48 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device Metal, is_swa = 0 load_tensors: layer 1 assigned to device Metal, is_swa = 0 load_tensors: layer 2 assigned to device Metal, is_swa = 0 load_tensors: layer 3 assigned to device Metal, is_swa = 0 load_tensors: layer 4 assigned to device Metal, is_swa = 0 load_tensors: layer 5 assigned to device Metal, is_swa = 0 load_tensors: layer 6 assigned to device Metal, is_swa = 0 load_tensors: layer 7 assigned to device Metal, is_swa = 0 load_tensors: layer 8 assigned to device Metal, is_swa = 0 load_tensors: layer 9 assigned to device Metal, is_swa = 0 load_tensors: layer 10 assigned to device Metal, is_swa = 0 load_tensors: layer 11 assigned to device Metal, is_swa = 0 load_tensors: layer 12 assigned to device Metal, is_swa = 0 load_tensors: layer 13 assigned to device Metal, is_swa = 0 load_tensors: layer 14 assigned to device Metal, is_swa = 0 load_tensors: layer 15 assigned to device Metal, is_swa = 0 load_tensors: layer 16 assigned to device Metal, is_swa = 0 load_tensors: layer 17 assigned to device Metal, is_swa = 0 load_tensors: layer 18 assigned to device Metal, is_swa = 0 load_tensors: layer 19 assigned to device Metal, is_swa = 0 load_tensors: layer 20 assigned to device Metal, is_swa = 0 load_tensors: layer 21 assigned to device Metal, is_swa = 0 load_tensors: layer 22 assigned to device Metal, is_swa = 0 load_tensors: layer 23 assigned to device Metal, is_swa = 0 load_tensors: layer 24 assigned to device Metal, is_swa = 0 load_tensors: offloading 24 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 25/25 layers to GPU load_tensors: CPU_Mapped model buffer size = 520.30 MiB load_tensors: Metal_Mapped model buffer size = 577.23 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 8192 llama_context: n_ctx_per_seq = 8192 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 0 llama_context: flash_attn = 0 llama_context: freq_base = 10000.0 llama_context: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: picking default device: Apple M4 Pro ggml_metal_load_library: using embedded metal library ggml_metal_init: GPU name: Apple M4 Pro ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction = true ggml_metal_init: simdgroup matrix mul. = true ggml_metal_init: has residency sets = false ggml_metal_init: has bfloat = true ggml_metal_init: use bfloat = false ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 17179.89 MB ggml_metal_init: loaded kernel_add 0x14e208ae0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_add_row 0x14e209290 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_sub 0x14e209840 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_sub_row 0x14e209df0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul 0x14e20a3a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_row 0x14e20a950 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_div 0x14e20af00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_div_row 0x14e20b4b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_repeat_f32 0x14e20ba60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_repeat_f16 0x14e20bf60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_repeat_i32 0x14e20c460 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_repeat_i16 0x14e20c960 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_scale 0x14e20d480 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_scale_4 0x14e20dc30 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_clamp 0x14e20e440 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_tanh 0x14e20eb60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_relu 0x14e20f280 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_sigmoid 0x14e20f9a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_gelu 0x14e2100c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_gelu_4 0x14e210890 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_gelu_quick 0x14e210fb0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_gelu_quick_4 0x14e2116d0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_silu 0x14e211df0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_silu_4 0x14e212690 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_elu 0x14e212db0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_soft_max_f16 0x14e213250 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_soft_max_f16_4 0x14e2136f0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_soft_max_f32 0x14e213d90 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_soft_max_f32_4 0x14e214230 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_diag_mask_inf 0x14e2146d0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_diag_mask_inf_8 0x14e214990 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_f32 0x14e215080 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_f16 0x14e215340 | th_max = 1024 | th_width = 32 ggml_metal_init: skipping kernel_get_rows_bf16 (not supported) ggml_metal_init: loaded kernel_get_rows_q4_0 0x14e2157e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q4_1 0x14e215c80 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q5_0 0x14e216120 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q5_1 0x14e2165c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q8_0 0x14e216a60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q2_K 0x14e216f00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q3_K 0x14e2173a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q4_K 0x14e217840 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q5_K 0x14e217ce0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q6_K 0x14e218180 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_iq2_xxs 0x14e218440 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_iq2_xs 0x14e218950 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_iq3_xxs 0x14e218e60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_iq3_s 0x14e219550 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_iq2_s 0x14e219d00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_iq1_s 0x14e21a1a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_iq1_m 0x14e21a640 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_iq4_nl 0x14e21aae0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_iq4_xs 0x14e21af80 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_i32 0x14e21b420 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rms_norm 0x14e21b8c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_l2_norm 0x14e21bd60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_group_norm 0x14e21c200 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_norm 0x14e21c6a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_ssm_conv_f32 0x14e21cb40 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_ssm_scan_f32 0x14e21d090 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rwkv_wkv6_f32 0x14e21d530 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rwkv_wkv7_f32 0x14e21d7f0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_f32_f32 0x14e21dc90 | th_max = 1024 | th_width = 32 ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported) ggml_metal_init: loaded kernel_mul_mv_f16_f32 0x14e21e130 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_f16_f32_1row 0x14e21e5d0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_f16_f32_l4 0x14e21ea70 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_f16_f16 0x14e21ef10 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q4_0_f32 0x14e21f3b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q4_1_f32 0x14e21f850 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q5_0_f32 0x14e21fcf0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q5_1_f32 0x14e220190 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q8_0_f32 0x14e220630 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_f16_f32_r1_2 0x14e220b80 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_f16_f32_r1_3 0x14e2210d0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_f16_f32_r1_4 0x14e221620 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_f16_f32_r1_5 0x14e221b70 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q4_0_f32_r1_2 0x14e2220c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q4_0_f32_r1_3 0x14e222610 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q4_0_f32_r1_4 0x14e222b60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q4_0_f32_r1_5 0x14e2230b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q4_1_f32_r1_2 0x14e223600 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q4_1_f32_r1_3 0x14e223b50 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q4_1_f32_r1_4 0x14e2240a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q4_1_f32_r1_5 0x14e2245f0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q5_0_f32_r1_2 0x14e224b40 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q5_0_f32_r1_3 0x14e225090 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q5_0_f32_r1_4 0x14e2255e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q5_0_f32_r1_5 0x14e225b30 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q5_1_f32_r1_2 0x14e226080 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q5_1_f32_r1_3 0x14e2265d0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q5_1_f32_r1_4 0x14e226b20 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q5_1_f32_r1_5 0x14e227070 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q8_0_f32_r1_2 0x14e2275c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q8_0_f32_r1_3 0x14e227b10 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q8_0_f32_r1_4 0x14e228060 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q8_0_f32_r1_5 0x14e2285b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q4_K_f32_r1_2 0x14e228b00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q4_K_f32_r1_3 0x14e229050 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q4_K_f32_r1_4 0x14e219810 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q4_K_f32_r1_5 0x14e2294c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q5_K_f32_r1_2 0x14e229c70 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q5_K_f32_r1_3 0x14e22a1c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q5_K_f32_r1_4 0x14e22a710 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q5_K_f32_r1_5 0x14e22ac60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q6_K_f32_r1_2 0x14e22b1b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q6_K_f32_r1_3 0x14e22b700 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q6_K_f32_r1_4 0x14e22bc50 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_q6_K_f32_r1_5 0x14e22c1a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_iq4_nl_f32_r1_2 0x14e22c6f0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_iq4_nl_f32_r1_3 0x14e22cc40 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_iq4_nl_f32_r1_4 0x14e22d190 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_ext_iq4_nl_f32_r1_5 0x14e22d6e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q2_K_f32 0x14e22db80 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q3_K_f32 0x14e22e020 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q4_K_f32 0x14e22e4c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q5_K_f32 0x14e22e960 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_q6_K_f32 0x14e22ee00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_iq2_xxs_f32 0x14e22f2a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_iq2_xs_f32 0x14e22f740 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_iq3_xxs_f32 0x14e22fbe0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_iq3_s_f32 0x14e230080 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_iq2_s_f32 0x14e230520 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_iq1_s_f32 0x14e2309c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_iq1_m_f32 0x14e230e60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_iq4_nl_f32 0x14e231300 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_iq4_xs_f32 0x14e2317a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_f32_f32 0x14e231c40 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_f16_f32 0x14e2320e0 | th_max = 1024 | th_width = 32 ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported) ggml_metal_init: loaded kernel_mul_mv_id_q4_0_f32 0x14e232580 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_q4_1_f32 0x14e232a20 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_q5_0_f32 0x14e232ec0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_q5_1_f32 0x14e233360 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_q8_0_f32 0x14e233800 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_q2_K_f32 0x14e233ca0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_q3_K_f32 0x14e234140 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_q4_K_f32 0x14e2345e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_q5_K_f32 0x14e234a80 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_q6_K_f32 0x14e234f20 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_iq2_xxs_f32 0x14e2353c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_iq2_xs_f32 0x14e235860 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_iq3_xxs_f32 0x14e235d00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_iq3_s_f32 0x14e2361a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_iq2_s_f32 0x14e236640 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_iq1_s_f32 0x14e236ae0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_iq1_m_f32 0x14e236f80 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_iq4_nl_f32 0x14e237420 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mv_id_iq4_xs_f32 0x14e2378c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_f32_f32 0x14e237d60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x14e238200 | th_max = 1024 | th_width = 32 ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported) ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x14e2386a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x14e238b40 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q5_0_f32 0x14e238fe0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q5_1_f32 0x14e239480 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q8_0_f32 0x14e239920 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x14e239dc0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x14e23a260 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x14e23a700 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x14e23aba0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x14e23b040 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_iq2_xxs_f32 0x14e23b4e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_iq2_xs_f32 0x14e23b980 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_iq3_xxs_f32 0x14e23be20 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_iq3_s_f32 0x14e23c2c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_iq2_s_f32 0x14e23c760 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_iq1_s_f32 0x14e23cc00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_iq1_m_f32 0x14e23d0a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_iq4_nl_f32 0x14e23d540 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_iq4_xs_f32 0x14e23d9e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_map0_f16 0x14e23de80 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_map1_f32 0x14e23e320 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_f32_f16 0x14e23e7c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_f16_f16 0x14e23ec60 | th_max = 1024 | th_width = 32 ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16 (not supported) ggml_metal_init: loaded kernel_mul_mm_id_q4_0_f16 0x14e23f100 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_q4_1_f16 0x14e23f5a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_q5_0_f16 0x14e23fa40 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_q5_1_f16 0x14e23fee0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_q8_0_f16 0x14e240380 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_q2_K_f16 0x14e240820 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_q3_K_f16 0x14e240cc0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_q4_K_f16 0x14e241160 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_q5_K_f16 0x14e241600 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_q6_K_f16 0x14e241aa0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_iq2_xxs_f16 0x14e241f40 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_iq2_xs_f16 0x14e2423e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_iq3_xxs_f16 0x14e242880 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_iq3_s_f16 0x14e242d20 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_iq2_s_f16 0x14e2431c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_iq1_s_f16 0x14e243660 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_iq1_m_f16 0x14e243b00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_iq4_nl_f16 0x14e243fa0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_id_iq4_xs_f16 0x14e244440 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rope_norm_f32 0x14e244990 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rope_norm_f16 0x14e244ee0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rope_multi_f32 0x14e245430 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rope_multi_f16 0x14e245980 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rope_vision_f32 0x14e245ed0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rope_vision_f16 0x14e246420 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rope_neox_f32 0x14e246970 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rope_neox_f16 0x14e246ec0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_im2col_f16 0x14e247360 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_im2col_f32 0x14e247800 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_im2col_ext_f16 0x14e247ca0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_im2col_ext_f32 0x14e248140 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_conv_transpose_1d_f32_f32 0x14e2485e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_conv_transpose_1d_f16_f32 0x14e248a80 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_upscale_f32 0x14e248fd0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_pad_f32 0x14e249470 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_pad_reflect_1d_f32 0x14e249910 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_timestep_embedding_f32 0x14e249db0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_arange_f32 0x14e24a250 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_argsort_f32_i32_asc 0x14e24a6f0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_argsort_f32_i32_desc 0x14e24ab90 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_leaky_relu_f32 0x14e24b3e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_f16_h64 0x14e24b930 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_f16_h80 0x14e24be80 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_f16_h96 0x14e24c3d0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_f16_h112 0x14e24c920 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_f16_h128 0x14e24ce70 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_f16_h192 0x14e24d3c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_f16_hk192_hv128 0x14e24d910 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_f16_h256 0x14e24de60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_f16_hk576_hv512 0x14e24e3b0 | th_max = 1024 | th_width = 32 ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512 (not supported) ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h64 0x14e24e900 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h80 0x14e24ee50 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h96 0x14e24f3a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h112 0x14e24f8f0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h128 0x14e24fe40 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h192 0x14e250390 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_hk192_hv128 0x14e250650 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h256 0x14e250eb0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_hk576_hv512 0x14e251170 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h64 0x14e2519d0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h80 0x14e251f20 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h96 0x14e252470 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h112 0x14e2529c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h128 0x14e252f10 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h192 0x14e253460 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_hk192_hv128 0x14e253720 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h256 0x14e253f80 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_hk576_hv512 0x14e254240 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h64 0x14e254aa0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h80 0x14e254ff0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h96 0x14e255540 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h112 0x14e255a90 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h128 0x14e255fe0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h192 0x14e256530 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_hk192_hv128 0x14e2567f0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h256 0x14e257050 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_hk576_hv512 0x14e257310 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h64 0x14e257b70 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h80 0x14e2580c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h96 0x14e258610 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h112 0x14e258b60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h128 0x14e2590b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h192 0x14e259600 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_hk192_hv128 0x14e2598c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h256 0x14e25a120 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_hk576_hv512 0x14e25a3e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h64 0x14e25ac40 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h80 0x14e25b190 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h96 0x14e25b6e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h112 0x14e25bc30 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h128 0x14e25c180 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h192 0x14e25c6d0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_hk192_hv128 0x14e25c990 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h256 0x14e25d1f0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_hk576_hv512 0x14e25d4b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_f16_h96 0x14e25dd10 | th_max = 1024 | th_width = 32 ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96 (not supported) ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_0_h96 0x14e25e260 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_1_h96 0x14e25e7b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_0_h96 0x14e25ed00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_1_h96 0x14e25f250 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q8_0_h96 0x14e25f7a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_f16_h128 0x14e25fcf0 | th_max = 1024 | th_width = 32 ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported) ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_0_h128 0x14e260240 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_1_h128 0x14e260790 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_0_h128 0x14e260ce0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_1_h128 0x14e261230 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q8_0_h128 0x14e261780 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_f16_h192 0x14e261cd0 | th_max = 1024 | th_width = 32 ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192 (not supported) ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_0_h192 0x14e262220 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_1_h192 0x14e262770 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_0_h192 0x14e262cc0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_1_h192 0x14e263210 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q8_0_h192 0x14e263760 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_f16_hk192_hv128 0x14e263a20 | th_max = 1024 | th_width = 32 ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported) ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_0_hk192_hv128 0x14e263ff0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_1_hk192_hv128 0x14e2645c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_0_hk192_hv128 0x14e264b90 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_1_hk192_hv128 0x14e265160 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q8_0_hk192_hv128 0x14e265730 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_f16_h256 0x14e265f90 | th_max = 1024 | th_width = 32 ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported) ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_0_h256 0x14e2664e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_1_h256 0x14e266a30 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_0_h256 0x14e266f80 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_1_h256 0x14e2674d0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q8_0_h256 0x14e267a20 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_f16_hk576_hv512 0x14e267ce0 | th_max = 1024 | th_width = 32 ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported) ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_0_hk576_hv512 0x14e2682b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_1_hk576_hv512 0x14e268880 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_0_hk576_hv512 0x14e268e50 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_1_hk576_hv512 0x14e269420 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_flash_attn_ext_vec_q8_0_hk576_hv512 0x14e2699f0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_set_f32 0x14e26a1a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_set_i32 0x14e26a640 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f32_f32 0x14e26aae0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f32_f16 0x14e26af80 | th_max = 1024 | th_width = 32 ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported) ggml_metal_init: loaded kernel_cpy_f16_f32 0x14e26b420 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f16_f16 0x14e26b8c0 | th_max = 1024 | th_width = 32 ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported) ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported) ggml_metal_init: loaded kernel_cpy_f32_q8_0 0x14e26bd60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f32_q4_0 0x14e26c200 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f32_q4_1 0x14e26c6a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f32_q5_0 0x14e26cb40 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f32_q5_1 0x14e26cfe0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f32_iq4_nl 0x14e26d480 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_q4_0_f32 0x14e26d920 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_q4_0_f16 0x14e26ddc0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_q4_1_f32 0x14e26e260 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_q4_1_f16 0x14e26e700 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_q5_0_f32 0x14e26eba0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_q5_0_f16 0x14e26f040 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_q5_1_f32 0x14e26f4e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_q5_1_f16 0x14e26f980 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_q8_0_f32 0x14e26fe20 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_q8_0_f16 0x14e2702c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_concat 0x14e270810 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_sqr 0x14e270f30 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_sqrt 0x14e271650 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_sin 0x14e271d70 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cos 0x14e272490 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_neg 0x14e272bb0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_sum_rows 0x14e273100 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_argmax 0x14e2735a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_pool_2d_avg_f32 0x14e273a40 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_pool_2d_max_f32 0x14e273ee0 | th_max = 1024 | th_width = 32 set_abort_callback: call llama_context: CPU output buffer size = 0.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 3 llama_context: max_nodes = 65536 time=2025-05-24T12:32:24.951+03:00 level=INFO source=server.go:630 msg="llama runner started in 0.76 seconds" time=2025-05-24T12:32:24.951+03:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/bge-m3:567m runner.inference=metal runner.devices=1 runner.size="1.7 GiB" runner.vram="1.7 GiB" runner.parallel=1 runner.pid=82229 runner.model=/Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c runner.num_ctx=8192 time=2025-05-24T12:32:24.957+03:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=3807 used=0 remaining=3807 decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) [GIN] 2025/05/24 - 12:32:25 | 200 | 1.637294875s | 127.0.0.1 | POST "/api/embeddings" time=2025-05-24T12:32:25.562+03:00 level=DEBUG source=sched.go:503 msg="context for request finished" time=2025-05-24T12:32:25.562+03:00 level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/bge-m3:567m runner.inference=metal runner.devices=1 runner.size="1.7 GiB" runner.vram="1.7 GiB" runner.parallel=1 runner.pid=82229 runner.model=/Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c runner.num_ctx=8192 duration=5m0s time=2025-05-24T12:32:25.562+03:00 level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/bge-m3:567m runner.inference=metal runner.devices=1 runner.size="1.7 GiB" runner.vram="1.7 GiB" runner.parallel=1 runner.pid=82229 runner.model=/Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c runner.num_ctx=8192 refCount=0 time=2025-05-24T12:32:25.606+03:00 level=DEBUG source=ggml.go:155 msg="key not found" key=general.alignment default=32 time=2025-05-24T12:32:25.606+03:00 level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=/Users/hillar/.ollama/models/blobs/sha256-daec91ffb5dd0c27411bd71f29932917c49cf529a641d0168496c3a501e3062c time=2025-05-24T12:32:25.610+03:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=3807 prompt=3392 used=0 remaining=3392 decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) ```
Author
Owner

@jfbloom22 commented on GitHub (May 24, 2025):

Running into the same issue while running "Reindex Knowledge Base Vectors" in Open WebUI.. Running on Linode CPU only. Tried snowflake-arctic-embed2:568m and jeffh/intfloat-multilingual-e5-large-instruct:f16

Looks like you are running an M4 MacBook Pro. That rules out my suspicion that I needed a GPU.

<!-- gh-comment-id:2907019038 --> @jfbloom22 commented on GitHub (May 24, 2025): Running into the same issue while running "Reindex Knowledge Base Vectors" in Open WebUI.. Running on Linode CPU only. Tried `snowflake-arctic-embed2:568m` and `jeffh/intfloat-multilingual-e5-large-instruct:f16` Looks like you are running an M4 MacBook Pro. That rules out my suspicion that I needed a GPU.
Author
Owner

@majnas commented on GitHub (May 25, 2025):

Same with "nomic-embed-text"

llmengine_dev_gpu_0a  | decode: cannot decode batches with this context (use llama_encode() instead)
llmengine_dev_gpu_0a  | decode: cannot decode batches with this context (use llama_encode() instead)
llmengine_dev_gpu_0a  | decode: cannot decode batches with this context (use llama_encode() instead)
llmengine_dev_gpu_0a  | decode: cannot decode batches with this context (use llama_encode() instead)
llmengine_dev_gpu_0a  | decode: cannot decode batches with this context (use llama_encode() instead)
<!-- gh-comment-id:2907525794 --> @majnas commented on GitHub (May 25, 2025): Same with "nomic-embed-text" ```bash llmengine_dev_gpu_0a | decode: cannot decode batches with this context (use llama_encode() instead) llmengine_dev_gpu_0a | decode: cannot decode batches with this context (use llama_encode() instead) llmengine_dev_gpu_0a | decode: cannot decode batches with this context (use llama_encode() instead) llmengine_dev_gpu_0a | decode: cannot decode batches with this context (use llama_encode() instead) llmengine_dev_gpu_0a | decode: cannot decode batches with this context (use llama_encode() instead) ```
Author
Owner

@f0rGoT-Ten commented on GitHub (May 25, 2025):

I am using Ollama in Lighting AI.
When I use "nomic-embed-text" as the embedding model in my project.
This error pops up in the terminal.

Is anybody able to fix this issue?

<!-- gh-comment-id:2907626084 --> @f0rGoT-Ten commented on GitHub (May 25, 2025): I am using Ollama in Lighting AI. When I use "nomic-embed-text" as the embedding model in my project. This error pops up in the terminal. Is anybody able to fix this issue?
Author
Owner

@Pedrofran682 commented on GitHub (May 25, 2025):

Same problem here. But when i try to execute on colab i don't get this problem and my code runs as expected

<!-- gh-comment-id:2908146361 --> @Pedrofran682 commented on GitHub (May 25, 2025): Same problem here. But when i try to execute on colab i don't get this problem and my code runs as expected
Author
Owner

@YetheSamartaka commented on GitHub (May 26, 2025):

Strangely the mxbai-embed-large:latest is working fine but bge-m3:latest not and I have exactly the same issue. Right now, I need to revert.

<!-- gh-comment-id:2908853325 --> @YetheSamartaka commented on GitHub (May 26, 2025): Strangely the mxbai-embed-large:latest is working fine but bge-m3:latest not and I have exactly the same issue. Right now, I need to revert.
Author
Owner

@jasonsi1993 commented on GitHub (May 26, 2025):

Strangely the mxbai-embed-large:latest is working fine but bge-m3:latest not and I have exactly the same issue. Right now, I need to revert.

Hi, Are you having trouble with mxbai-embed understanding the content of your doc? Is it better than bge-m3?

<!-- gh-comment-id:2909107611 --> @jasonsi1993 commented on GitHub (May 26, 2025): > Strangely the mxbai-embed-large:latest is working fine but bge-m3:latest not and I have exactly the same issue. Right now, I need to revert. Hi, Are you having trouble with mxbai-embed understanding the content of your doc? Is it better than bge-m3?
Author
Owner

@ezhil56x commented on GitHub (May 26, 2025):

I'm facing the same problem. Using nomic-embed-text in Ollama docker instance

<!-- gh-comment-id:2909626674 --> @ezhil56x commented on GitHub (May 26, 2025): I'm facing the same problem. Using `nomic-embed-text` in Ollama docker instance
Author
Owner

@jfbloom22 commented on GitHub (May 26, 2025):

wow I just noticed Ollama has 1.6k open issues. Sheesh. someone needs to help them. And by someone I mean an AI assistant: https://github.com/apps/dosubot

<!-- gh-comment-id:2910514901 --> @jfbloom22 commented on GitHub (May 26, 2025): wow I just noticed Ollama has 1.6k open issues. Sheesh. someone needs to help them. And by someone I mean an AI assistant: https://github.com/apps/dosubot
Author
Owner

@pomazanbohdan commented on GitHub (May 27, 2025):

Same problem.
Win11, Docker Desktop WSL2, ollama:latest, mxbai-embed-large

<!-- gh-comment-id:2911094459 --> @pomazanbohdan commented on GitHub (May 27, 2025): Same problem. Win11, Docker Desktop WSL2, ollama:latest, mxbai-embed-large
Author
Owner

@ProjectMoon commented on GitHub (May 27, 2025):

I am experiencing this with OpenWebUI using snowflake-arctic-embed 2. Started showing up after 0.7.1 perhaps?

<!-- gh-comment-id:2912422128 --> @ProjectMoon commented on GitHub (May 27, 2025): I am experiencing this with OpenWebUI using snowflake-arctic-embed 2. Started showing up after 0.7.1 perhaps?
Author
Owner

@hedrickbt commented on GitHub (May 27, 2025):

Seeing the issue with Ollama 0.7.0
Using open-webui 0.6.11 ( same issue in 0.6.10) pointed to the Ollama instance
Admin Panel | Documents | Embedding Mode: nomic-embed-text:latest

<!-- gh-comment-id:2913122890 --> @hedrickbt commented on GitHub (May 27, 2025): Seeing the issue with Ollama 0.7.0 Using open-webui 0.6.11 ( same issue in 0.6.10) pointed to the Ollama instance Admin Panel | Documents | Embedding Mode: nomic-embed-text:latest
Author
Owner

@rick-github commented on GitHub (May 27, 2025):

ollama 0.7.0 pulled in a new version of llama.cpp (https://github.com/ollama/ollama/commit/0cefd46f2). This included a change in llama.cpp that switched processing depending on the initialization state (6562e5a4d6). A later patch (https://github.com/ggml-org/llama.cpp/commit/79c137f77) not yet included in ollama changes this from WARN to DEBUG, so won't show up in the logs.

Is this actually affecting the output? A simple check seems to show that all of the embedding models mentioned above return the same output in 0.6.8 and 0.7.1

$ for i in 0.6.8 0.7.1 ; do OLLAMA_DEBUG=2 OLLAMA_DOCKER_TAG=$i OLLAMA_KEEP_ALIVE=-1 OLLAMA_NUM_PARALLEL=1 docker compose up -d ollama >&- 2>&- ; sleep 5 ; curl -s localhost:11434/api/version | jq ; for m in nomic-embed-text mxbai-embed-large bge-m3 snowflake-arctic-embed2 ; do printf "%-25s" $m ; e=$(curl -s http://localhost:11434/api/embed -d '{"model": "'$m'","input": "Llamas are members of the camelid family"}' ) ; echo -n "$(jq -nc "$e|.embeddings"|md5sum) " ; jq -nc "$e"'|.embeddings[]|.[0:3] + ["..."] + .[-3:]' ; done ; done
{
  "version": "0.6.8"
}
nomic-embed-text         21067846f30731b854594fa703ea5bf6  - [0.014560218,0.014495488,-0.1585055,"...",-0.036605217,-0.08269139,-0.06278599]
mxbai-embed-large        6466a34251f804d72111b82fe7f57e7f  - [0.032664254,0.06619621,0.03598059,"...",0.021244442,0.02600308,0.03980255]
bge-m3                   5f5b3577a9994ac65e3c319efc2597c9  - [-0.018193955,0.006162767,-0.048996422,"...",-0.036059953,-0.016258406,0.0039238334]
snowflake-arctic-embed2  11a34df194f6160778aebaf8c5d89de0  - [-0.038091954,0.0315819,-0.005225725,"...",-0.0039103366,0.01802737,-0.0016246102]
{
  "version": "0.7.1"
}
nomic-embed-text         21067846f30731b854594fa703ea5bf6  - [0.014560218,0.014495488,-0.1585055,"...",-0.036605217,-0.08269139,-0.06278599]
mxbai-embed-large        6466a34251f804d72111b82fe7f57e7f  - [0.032664254,0.06619621,0.03598059,"...",0.021244442,0.02600308,0.03980255]
bge-m3                   5f5b3577a9994ac65e3c319efc2597c9  - [-0.018193955,0.006162767,-0.048996422,"...",-0.036059953,-0.016258406,0.0039238334]
snowflake-arctic-embed2  11a34df194f6160778aebaf8c5d89de0  - [-0.038091954,0.0315819,-0.005225725,"...",-0.0039103366,0.01802737,-0.0016246102]
<!-- gh-comment-id:2913170542 --> @rick-github commented on GitHub (May 27, 2025): ollama 0.7.0 pulled in a new version of llama.cpp (https://github.com/ollama/ollama/commit/0cefd46f2). This included a change in llama.cpp that switched processing depending on the initialization state (https://github.com/ggml-org/llama.cpp/commit/6562e5a4d6c58326dcd79002ea396d4141f1b18e). A later patch (https://github.com/ggml-org/llama.cpp/commit/79c137f77) not yet included in ollama changes this from WARN to DEBUG, so won't show up in the logs. Is this actually affecting the output? A simple check seems to show that all of the embedding models mentioned above return the same output in 0.6.8 and 0.7.1 ```console $ for i in 0.6.8 0.7.1 ; do OLLAMA_DEBUG=2 OLLAMA_DOCKER_TAG=$i OLLAMA_KEEP_ALIVE=-1 OLLAMA_NUM_PARALLEL=1 docker compose up -d ollama >&- 2>&- ; sleep 5 ; curl -s localhost:11434/api/version | jq ; for m in nomic-embed-text mxbai-embed-large bge-m3 snowflake-arctic-embed2 ; do printf "%-25s" $m ; e=$(curl -s http://localhost:11434/api/embed -d '{"model": "'$m'","input": "Llamas are members of the camelid family"}' ) ; echo -n "$(jq -nc "$e|.embeddings"|md5sum) " ; jq -nc "$e"'|.embeddings[]|.[0:3] + ["..."] + .[-3:]' ; done ; done { "version": "0.6.8" } nomic-embed-text 21067846f30731b854594fa703ea5bf6 - [0.014560218,0.014495488,-0.1585055,"...",-0.036605217,-0.08269139,-0.06278599] mxbai-embed-large 6466a34251f804d72111b82fe7f57e7f - [0.032664254,0.06619621,0.03598059,"...",0.021244442,0.02600308,0.03980255] bge-m3 5f5b3577a9994ac65e3c319efc2597c9 - [-0.018193955,0.006162767,-0.048996422,"...",-0.036059953,-0.016258406,0.0039238334] snowflake-arctic-embed2 11a34df194f6160778aebaf8c5d89de0 - [-0.038091954,0.0315819,-0.005225725,"...",-0.0039103366,0.01802737,-0.0016246102] { "version": "0.7.1" } nomic-embed-text 21067846f30731b854594fa703ea5bf6 - [0.014560218,0.014495488,-0.1585055,"...",-0.036605217,-0.08269139,-0.06278599] mxbai-embed-large 6466a34251f804d72111b82fe7f57e7f - [0.032664254,0.06619621,0.03598059,"...",0.021244442,0.02600308,0.03980255] bge-m3 5f5b3577a9994ac65e3c319efc2597c9 - [-0.018193955,0.006162767,-0.048996422,"...",-0.036059953,-0.016258406,0.0039238334] snowflake-arctic-embed2 11a34df194f6160778aebaf8c5d89de0 - [-0.038091954,0.0315819,-0.005225725,"...",-0.0039103366,0.01802737,-0.0016246102] ```
Author
Owner

@joestump commented on GitHub (May 30, 2025):

Have been running into this issue for at least a couple of weeks now. Switching to mxbai-embed-large:latest from nomic-embed-text:latest resolved the issue for me. Running latest.

<!-- gh-comment-id:2923371004 --> @joestump commented on GitHub (May 30, 2025): Have been running into this issue for at least a couple of weeks now. Switching to `mxbai-embed-large:latest` from `nomic-embed-text:latest` resolved the issue for me. Running `latest`.
Author
Owner

@diramazioni commented on GitHub (Jun 4, 2025):

why was this closed?

<!-- gh-comment-id:2940516469 --> @diramazioni commented on GitHub (Jun 4, 2025): why was this closed?
Author
Owner

@rick-github commented on GitHub (Jun 4, 2025):

Because there was no evidence of an actual problem. Feel free to add some.

<!-- gh-comment-id:2940529429 --> @rick-github commented on GitHub (Jun 4, 2025): Because there was no evidence of an actual problem. Feel free to add some.
Author
Owner

@chxb commented on GitHub (Jun 10, 2025):

Same problem in Dify 1.4.0, Ollama 0.9.0:
llama_context: n_ctx_per_seq (4096) > n_ctx_train (512) -- possible training context overflow

Change the document max segment length from 1024 to 512 and it works normally.

<!-- gh-comment-id:2957774752 --> @chxb commented on GitHub (Jun 10, 2025): Same problem in Dify 1.4.0, Ollama 0.9.0: `llama_context: n_ctx_per_seq (4096) > n_ctx_train (512) -- possible training context overflow` Change the document max segment length from 1024 to 512 and it works normally.
Author
Owner

@juanfran-vsystem commented on GitHub (Jun 10, 2025):

I'm trying v0.9.1-rc0 and still having the same issue. I'm using open-webui and ollama in docker in windows. I've tried using apache tika in open-webui. also using nomic-embed-text and mxbai-embed-large and changing the segment size to 1000, 500, 400... same result.

ollama log fragment:

decode: cannot decode batches with this context (use llama_encode() instead)
[GIN] 2025/06/10 - 09:04:23 | 200 |  141.747087ms |      172.18.0.7 | POST     "/api/embed"
decode: cannot decode batches with this context (use llama_encode() instead)
[GIN] 2025/06/10 - 09:04:23 | 200 |  148.259835ms |      172.18.0.7 | POST     "/api/embed"
decode: cannot decode batches with this context (use llama_encode() instead)
[GIN] 2025/06/10 - 09:04:23 | 200 |  159.877858ms |      172.18.0.7 | POST     "/api/embed"
decode: cannot decode batches with this context (use llama_encode() instead)
[GIN] 2025/06/10 - 09:04:23 | 200 |   143.49488ms |      172.18.0.7 | POST     "/api/embed"
<!-- gh-comment-id:2958314023 --> @juanfran-vsystem commented on GitHub (Jun 10, 2025): I'm trying v0.9.1-rc0 and still having the same issue. I'm using open-webui and ollama in docker in windows. I've tried using apache tika in open-webui. also using nomic-embed-text and mxbai-embed-large and changing the segment size to 1000, 500, 400... same result. ollama log fragment: ``` decode: cannot decode batches with this context (use llama_encode() instead) [GIN] 2025/06/10 - 09:04:23 | 200 | 141.747087ms | 172.18.0.7 | POST "/api/embed" decode: cannot decode batches with this context (use llama_encode() instead) [GIN] 2025/06/10 - 09:04:23 | 200 | 148.259835ms | 172.18.0.7 | POST "/api/embed" decode: cannot decode batches with this context (use llama_encode() instead) [GIN] 2025/06/10 - 09:04:23 | 200 | 159.877858ms | 172.18.0.7 | POST "/api/embed" decode: cannot decode batches with this context (use llama_encode() instead) [GIN] 2025/06/10 - 09:04:23 | 200 | 143.49488ms | 172.18.0.7 | POST "/api/embed" ```
Author
Owner

@rick-github commented on GitHub (Jun 10, 2025):

Is it actually affecting the output?

<!-- gh-comment-id:2958353464 --> @rick-github commented on GitHub (Jun 10, 2025): Is it actually affecting the output?
Author
Owner

@juanfran-vsystem commented on GitHub (Jun 10, 2025):

I've tried again adding a 600kb html document in open-webui knowledge base, then the previous warning in ollama log gets repeated while reading the file and finally open-webui shows a red notification saying: "400: Embedding dimension 768 does not match collection dimensionality 1024". In other attempts I stopped ollama thinking it was an endless loop so I didn't see that notification.

<!-- gh-comment-id:2958437426 --> @juanfran-vsystem commented on GitHub (Jun 10, 2025): I've tried again adding a 600kb html document in open-webui knowledge base, then the previous warning in ollama log gets repeated while reading the file and finally open-webui shows a red notification saying: "400: Embedding dimension 768 does not match collection dimensionality 1024". In other attempts I stopped ollama thinking it was an endless loop so I didn't see that notification.
Author
Owner

@rick-github commented on GitHub (Jun 10, 2025):

"400: Embedding dimension 768 does not match collection dimensionality 1024"

Sounds like a mismatch between your vector store and choice of embedding model. What model were you using, and how is your vector store configured?

<!-- gh-comment-id:2958449878 --> @rick-github commented on GitHub (Jun 10, 2025): > "400: Embedding dimension 768 does not match collection dimensionality 1024" Sounds like a mismatch between your vector store and choice of embedding model. What model were you using, and how is your vector store configured?
Author
Owner

@juanfran-vsystem commented on GitHub (Jun 10, 2025):

I think so. I've recreated the collection in open-webui so its dimension matches the model, now I don't get that error in open-webui, but warnings in ollama's log are the same. The issue now is gemma3:12b doesn't find information in the file when asked. The file contains a table of codes and prices and gemma3 can't find the first code when asked. Sorry I'm a bit newbie in AI things.

<!-- gh-comment-id:2958494928 --> @juanfran-vsystem commented on GitHub (Jun 10, 2025): I think so. I've recreated the collection in open-webui so its dimension matches the model, now I don't get that error in open-webui, but warnings in ollama's log are the same. The issue now is gemma3:12b doesn't find information in the file when asked. The file contains a table of codes and prices and gemma3 can't find the first code when asked. Sorry I'm a bit newbie in AI things.
Author
Owner

@rick-github commented on GitHub (Jun 10, 2025):

The issue now is gemma3:12b doesn't find information in the file when asked.

This seems like an open-webui issue, their issue tracker is here.

<!-- gh-comment-id:2958523457 --> @rick-github commented on GitHub (Jun 10, 2025): > The issue now is gemma3:12b doesn't find information in the file when asked. This seems like an open-webui issue, their issue tracker is [here](https://github.com/open-webui/open-webui/issues).
Author
Owner

@tjwebb commented on GitHub (Jun 17, 2025):

also seeing this with granite-embedding:278m model, but I am not using open-webui I am calling the ollama API directly.

<!-- gh-comment-id:2980820212 --> @tjwebb commented on GitHub (Jun 17, 2025): also seeing this with granite-embedding:278m model, but I am not using open-webui I am calling the ollama API directly.
Author
Owner

@rick-github commented on GitHub (Jun 17, 2025):

Which issue? decode: cannot decode batches or Embedding dimension 768 does not match? If the former, it seems like it's not an issue, post details if think it is. It it's the latter, post examples of direct API calls that exhibit this behaviour.

<!-- gh-comment-id:2980842122 --> @rick-github commented on GitHub (Jun 17, 2025): Which issue? `decode: cannot decode batches` or `Embedding dimension 768 does not match`? If the former, it seems like it's [not an issue](https://github.com/ollama/ollama/issues/10811#issuecomment-2913170542), post details if think it is. It it's the latter, post examples of direct API calls that exhibit this behaviour.
Author
Owner

@tjwebb commented on GitHub (Jun 17, 2025):

No the original issue: decode: cannot decode batches with this context (use llama_encode() instead)

I can't use any embedding model on the latest ollama because of this error.

It goes away if I downgrade to 0.6.1

<!-- gh-comment-id:2980855495 --> @tjwebb commented on GitHub (Jun 17, 2025): No the original issue: `decode: cannot decode batches with this context (use llama_encode() instead)` I can't use any embedding model on the latest ollama because of this error. It goes away if I downgrade to 0.6.1
Author
Owner

@rick-github commented on GitHub (Jun 17, 2025):

https://github.com/ollama/ollama/issues/10811#issuecomment-2913170542

<!-- gh-comment-id:2980861868 --> @rick-github commented on GitHub (Jun 17, 2025): https://github.com/ollama/ollama/issues/10811#issuecomment-2913170542
Author
Owner

@tjwebb commented on GitHub (Jun 17, 2025):

It's not just a warning for me though, ollama 0.7.0+ repeatedly crashes with 500 errors. So I'm stuck on 0.6.1 for now

<!-- gh-comment-id:2980876894 --> @tjwebb commented on GitHub (Jun 17, 2025): It's not just a warning for me though, ollama 0.7.0+ repeatedly crashes with 500 errors. So I'm stuck on 0.6.1 for now
Author
Owner

@ibbobud commented on GitHub (Jun 17, 2025):

i am having the same issue using all-minilm

<!-- gh-comment-id:2980880298 --> @ibbobud commented on GitHub (Jun 17, 2025): i am having the same issue using [all-minilm](https://ollama.com/library/all-minilm)
Author
Owner

@rick-github commented on GitHub (Jun 17, 2025):

Server logs will aid in debugging.

<!-- gh-comment-id:2980885995 --> @rick-github commented on GitHub (Jun 17, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@Strat00s commented on GitHub (Jun 21, 2025):

Why is this closed when it's still a thing?

<!-- gh-comment-id:2993680399 --> @Strat00s commented on GitHub (Jun 21, 2025): Why is this closed when it's still a thing?
Author
Owner

@rick-github commented on GitHub (Jun 21, 2025):

https://github.com/ollama/ollama/issues/10811#issuecomment-2940529429

<!-- gh-comment-id:2993778311 --> @rick-github commented on GitHub (Jun 21, 2025): https://github.com/ollama/ollama/issues/10811#issuecomment-2940529429
Author
Owner

@teamolhuang commented on GitHub (Jun 24, 2025):

So, in a human understandable way, the message in title is just a warning that's expected to happen, because llama.cpp simply logs it and continues to do other things.

I am using model multilingual-e5-large which supports up to 512 tokens. I used GPT-4 token counter for exactly 512 tokens and it keeps erroring out. Ollama returns 500, but no visible reason logs in server.log except the cannot decode warning.

I then reduced chunks to 256 tokens, kept every other configurations same, and everything went fine without any error. Since people saying before 0.7 this was working, perhaps somewhere since 0.7, Ollama doesn't truncate tokens anymore, results in too much token for the embedding model, thus error?

<!-- gh-comment-id:2998561983 --> @teamolhuang commented on GitHub (Jun 24, 2025): So, in a human understandable way, the message in title is just a warning that's expected to happen, because llama.cpp simply logs it and continues to do other things. I am using model multilingual-e5-large which supports up to 512 tokens. I used GPT-4 token counter for exactly 512 tokens and it keeps erroring out. Ollama returns 500, but no visible reason logs in server.log except the cannot decode warning. I then reduced chunks to 256 tokens, kept every other configurations same, and everything went fine without any error. Since people saying before 0.7 this was working, perhaps somewhere since 0.7, Ollama doesn't truncate tokens anymore, results in too much token for the embedding model, thus error?
Author
Owner

@rick-github commented on GitHub (Jun 24, 2025):

Server logs will aid in debugging. If there's no visible reason in the log, try increasing verbosity by setting OLLAMA_DEBUG=2 in the server environment.

If it seems sensitive to the length of the input, it might be https://github.com/ollama/ollama/issues/7288#issuecomment-2591709109.

<!-- gh-comment-id:2999234595 --> @rick-github commented on GitHub (Jun 24, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging. If there's no visible reason in the log, try increasing verbosity by setting `OLLAMA_DEBUG=2` in the server environment. If it seems sensitive to the length of the input, it might be https://github.com/ollama/ollama/issues/7288#issuecomment-2591709109.
Author
Owner

@MSVstudios commented on GitHub (Jul 26, 2025):

Most user issues arise from a lack of understanding of what context size means and how tools use it. If the embedding model has a context input limit of 512 tokens, and you or the tool you're using provides more tokens, Ollama will return an error or a warning.

<!-- gh-comment-id:3122263968 --> @MSVstudios commented on GitHub (Jul 26, 2025): Most user issues arise from a lack of understanding of what context size means and how tools use it. If the embedding model has a context input limit of 512 tokens, and you or the tool you're using provides more tokens, Ollama will return an error or a warning.
Author
Owner

@eternal-bug commented on GitHub (Aug 16, 2025):

I am using the bge-m3:567m embedding model, and encounter the same warning:
decode: cannot decode batches with this context (use llama_encode() instead).
However, it turned out that the embedding vectors were usable.

overly long context isn't reason

Even if I embed only "hello", warning will still appear.

search the code

I searched for this reminder message in the source code of ollama (v0.11.4) and found that it was located at ./llama/llama.cpp/src/llama-context.cpp:849, the code is:

int llama_context::decode(llama_batch & inp_batch) {
    if (!memory) {
        LLAMA_LOG_WARN("%s: cannot decode batches with this context (use llama_encode() instead)\n", __func__);
        return encode(inp_batch);
    }

Recently, that is, on 2025-08-16, I noticed that changes have taken place here ./llama/llama.cpp/src/llama-context.cpp:950:

int llama_context::decode(const llama_batch & batch_inp) {
    GGML_ASSERT((!batch_inp.token && batch_inp.embd) || (batch_inp.token && !batch_inp.embd)); // NOLINT

    if (!memory) {
        LLAMA_LOG_DEBUG("%s: cannot decode batches with this context (calling encode() instead)\n", __func__);
        return encode(batch_inp);
    }

From LLAMA_LOG_WARN to LLAMA_LOG_DEBUG.

new version may solve the problem

So I guess that in versions after v0.11.4, the severity level of this warning has been changed to DEBUG, and it should no longer be displayed in the command line. So only wait new version!

The reason may be that the embedding model does not require KV Cache. In this case, llama_context does not allocate memory, so memory == nullptr. Therefore, the decode() function cannot be executed and has to fall back to the encode() function.
But, this is actually not an error, because the encode() function is precisely the operation that the embedding requires.

<!-- gh-comment-id:3193517244 --> @eternal-bug commented on GitHub (Aug 16, 2025): I am using the `bge-m3:567m` embedding model, and encounter the same warning: `decode: cannot decode batches with this context (use llama_encode() instead)`. However, it turned out that the embedding vectors were usable. ## overly long context isn't reason Even if I embed only "hello", warning will still appear. ## search the code I searched for this reminder message in the source code of ollama (v0.11.4) and found that it was located at [`./llama/llama.cpp/src/llama-context.cpp:849`](https://github.com/ollama/ollama/blob/v0.11.4/llama/llama.cpp/src/llama-context.cpp#L849), the code is: ```cpp int llama_context::decode(llama_batch & inp_batch) { if (!memory) { LLAMA_LOG_WARN("%s: cannot decode batches with this context (use llama_encode() instead)\n", __func__); return encode(inp_batch); } ``` Recently, that is, on 2025-08-16, I noticed that changes have taken place here [`./llama/llama.cpp/src/llama-context.cpp:950`](https://github.com/ollama/ollama/blob/v0.11.5-rc2/llama/llama.cpp/src/llama-context.cpp#L950): ```cpp int llama_context::decode(const llama_batch & batch_inp) { GGML_ASSERT((!batch_inp.token && batch_inp.embd) || (batch_inp.token && !batch_inp.embd)); // NOLINT if (!memory) { LLAMA_LOG_DEBUG("%s: cannot decode batches with this context (calling encode() instead)\n", __func__); return encode(batch_inp); } ``` From **`LLAMA_LOG_WARN`** to **`LLAMA_LOG_DEBUG`**. ## new version may solve the problem So I guess that in versions after v0.11.4, the severity level of this warning has been changed to DEBUG, and it should no longer be displayed in the command line. **So only wait new version!** The reason may be that the embedding model does not require KV Cache. In this case, llama_context does not allocate memory, so memory == nullptr. Therefore, the decode() function cannot be executed and has to fall back to the encode() function. But, this is actually not an error, because the encode() function is precisely the operation that the embedding requires.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69159