[GH-ISSUE #11017] decode: cannot decode batches with this context (use llama_encode() instead) #7264

Closed
opened 2026-04-12 19:18:46 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @oatmealm on GitHub (Jun 8, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11017

What is the issue?

Trying with different embedding models and batch size combinations I always see this error. Not sure how crucial it is beyond performance? The attached is for nomic-embed-text:v1.5 with batch size 1.

Relevant log output

time=2025-06-08T10:55:45.994+02:00 level=INFO source=sched.go:548 msg="updated VRAM based on existing loaded models" gpu=0 library=metal total="10.7 GiB" available="9.0 GiB"
time=2025-06-08T10:55:45.995+02:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/user/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 gpu=0 parallel=1 available=9680576512 required="864.9 MiB"
time=2025-06-08T10:55:45.995+02:00 level=INFO source=server.go:135 msg="system memory" total="16.0 GiB" free="9.7 GiB" free_swap="0 B"
time=2025-06-08T10:55:45.995+02:00 level=INFO source=server.go:168 msg=offload library=metal layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[9.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="864.9 MiB" memory.required.partial="864.9 MiB" memory.required.kv="24.0 MiB" memory.required.allocations="[864.9 MiB]" memory.weights.total="260.9 MiB" memory.weights.repeating="216.1 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="48.0 MiB" memory.graph.partial="48.0 MiB"
llama_model_load_from_file_impl: using device Metal (Apple M2 Pro) - 10922 MiB free
llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /Users/user/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  23:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:   51 tensors
llama_model_loader: - type  f16:   61 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 260.86 MiB (16.00 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 5
load: token to piece cache size = 0.2032 MB
print_info: arch             = nomic-bert
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 136.73 M
print_info: general.name     = nomic-embed-text-v1.5
print_info: vocab type       = WPM
print_info: n_vocab          = 30522
print_info: n_merges         = 0
print_info: BOS token        = 101 '[CLS]'
print_info: EOS token        = 102 '[SEP]'
print_info: UNK token        = 100 '[UNK]'
print_info: SEP token        = 102 '[SEP]'
print_info: PAD token        = 0 '[PAD]'
print_info: MASK token       = 103 '[MASK]'
print_info: LF token         = 0 '[PAD]'
print_info: EOG token        = 102 '[SEP]'
print_info: max token length = 21
llama_model_load: vocab only - skipping tensors
time=2025-06-08T10:55:46.027+02:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/opt/homebrew/Cellar/ollama/0.9.0/bin/ollama runner --model /Users/user/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 8192 --batch-size 512 --n-gpu-layers 13 --threads 8 --parallel 1 --port 60432"
time=2025-06-08T10:55:46.032+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=2
time=2025-06-08T10:55:46.032+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-06-08T10:55:46.033+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-06-08T10:55:46.050+02:00 level=INFO source=runner.go:815 msg="starting go runner"
time=2025-06-08T10:55:46.050+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
time=2025-06-08T10:55:46.051+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:60432"
llama_model_load_from_file_impl: using device Metal (Apple M2 Pro) - 10922 MiB free
llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /Users/user/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  23:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:   51 tensors
llama_model_loader: - type  f16:   61 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 260.86 MiB (16.00 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 5
load: token to piece cache size = 0.2032 MB
print_info: arch             = nomic-bert
print_info: vocab_only       = 0
print_info: n_ctx_train      = 2048
print_info: n_embd           = 768
print_info: n_layer          = 12
print_info: n_head           = 12
print_info: n_head_kv        = 12
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 768
print_info: n_embd_v_gqa     = 768
print_info: f_norm_eps       = 1.0e-12
print_info: f_norm_rms_eps   = 0.0e+00
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 3072
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 0
print_info: pooling type     = 1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 2048
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 137M
print_info: model params     = 136.73 M
print_info: general.name     = nomic-embed-text-v1.5
print_info: vocab type       = WPM
print_info: n_vocab          = 30522
print_info: n_merges         = 0
print_info: BOS token        = 101 '[CLS]'
print_info: EOS token        = 102 '[SEP]'
print_info: UNK token        = 100 '[UNK]'
print_info: SEP token        = 102 '[SEP]'
print_info: PAD token        = 0 '[PAD]'
print_info: MASK token       = 103 '[MASK]'
print_info: LF token         = 0 '[PAD]'
print_info: EOG token        = 102 '[SEP]'
print_info: max token length = 21
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 12 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 13/13 layers to GPU
load_tensors:   CPU_Mapped model buffer size =    44.72 MiB
load_tensors: Metal_Mapped model buffer size =   216.15 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 0
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (8192) > n_ctx_train (2048) -- possible training context overflow
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M2 Pro
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     0.00 MiB
time=2025-06-08T10:55:46.286+02:00 level=INFO source=server.go:630 msg="llama runner started in 0.25 seconds"
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
[GIN] 2025/06/08 - 10:55:48 | 200 |  2.263241667s |   192.168.2.132 | POST     "/api/embed"

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

9.0.0

Originally created by @oatmealm on GitHub (Jun 8, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11017 ### What is the issue? Trying with different embedding models and batch size combinations I always see this error. Not sure how crucial it is beyond performance? The attached is for nomic-embed-text:v1.5 with batch size 1. ### Relevant log output ```shell time=2025-06-08T10:55:45.994+02:00 level=INFO source=sched.go:548 msg="updated VRAM based on existing loaded models" gpu=0 library=metal total="10.7 GiB" available="9.0 GiB" time=2025-06-08T10:55:45.995+02:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/user/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 gpu=0 parallel=1 available=9680576512 required="864.9 MiB" time=2025-06-08T10:55:45.995+02:00 level=INFO source=server.go:135 msg="system memory" total="16.0 GiB" free="9.7 GiB" free_swap="0 B" time=2025-06-08T10:55:45.995+02:00 level=INFO source=server.go:168 msg=offload library=metal layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[9.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="864.9 MiB" memory.required.partial="864.9 MiB" memory.required.kv="24.0 MiB" memory.required.allocations="[864.9 MiB]" memory.weights.total="260.9 MiB" memory.weights.repeating="216.1 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="48.0 MiB" memory.graph.partial="48.0 MiB" llama_model_load_from_file_impl: using device Metal (Apple M2 Pro) - 10922 MiB free llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /Users/user/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = nomic-bert llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5 llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12 llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048 llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768 llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072 llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1 llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000 llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 15: tokenizer.ggml.model str = bert llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 22: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 23: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 51 tensors llama_model_loader: - type f16: 61 tensors print_info: file format = GGUF V3 (latest) print_info: file type = F16 print_info: file size = 260.86 MiB (16.00 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 5 load: token to piece cache size = 0.2032 MB print_info: arch = nomic-bert print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 136.73 M print_info: general.name = nomic-embed-text-v1.5 print_info: vocab type = WPM print_info: n_vocab = 30522 print_info: n_merges = 0 print_info: BOS token = 101 '[CLS]' print_info: EOS token = 102 '[SEP]' print_info: UNK token = 100 '[UNK]' print_info: SEP token = 102 '[SEP]' print_info: PAD token = 0 '[PAD]' print_info: MASK token = 103 '[MASK]' print_info: LF token = 0 '[PAD]' print_info: EOG token = 102 '[SEP]' print_info: max token length = 21 llama_model_load: vocab only - skipping tensors time=2025-06-08T10:55:46.027+02:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/opt/homebrew/Cellar/ollama/0.9.0/bin/ollama runner --model /Users/user/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 8192 --batch-size 512 --n-gpu-layers 13 --threads 8 --parallel 1 --port 60432" time=2025-06-08T10:55:46.032+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=2 time=2025-06-08T10:55:46.032+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-06-08T10:55:46.033+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-06-08T10:55:46.050+02:00 level=INFO source=runner.go:815 msg="starting go runner" time=2025-06-08T10:55:46.050+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) time=2025-06-08T10:55:46.051+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:60432" llama_model_load_from_file_impl: using device Metal (Apple M2 Pro) - 10922 MiB free llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /Users/user/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = nomic-bert llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5 llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12 llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048 llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768 llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072 llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1 llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000 llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 15: tokenizer.ggml.model str = bert llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 22: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 23: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 51 tensors llama_model_loader: - type f16: 61 tensors print_info: file format = GGUF V3 (latest) print_info: file type = F16 print_info: file size = 260.86 MiB (16.00 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 5 load: token to piece cache size = 0.2032 MB print_info: arch = nomic-bert print_info: vocab_only = 0 print_info: n_ctx_train = 2048 print_info: n_embd = 768 print_info: n_layer = 12 print_info: n_head = 12 print_info: n_head_kv = 12 print_info: n_rot = 64 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 64 print_info: n_embd_head_v = 64 print_info: n_gqa = 1 print_info: n_embd_k_gqa = 768 print_info: n_embd_v_gqa = 768 print_info: f_norm_eps = 1.0e-12 print_info: f_norm_rms_eps = 0.0e+00 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 3072 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 0 print_info: pooling type = 1 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 2048 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 137M print_info: model params = 136.73 M print_info: general.name = nomic-embed-text-v1.5 print_info: vocab type = WPM print_info: n_vocab = 30522 print_info: n_merges = 0 print_info: BOS token = 101 '[CLS]' print_info: EOS token = 102 '[SEP]' print_info: UNK token = 100 '[UNK]' print_info: SEP token = 102 '[SEP]' print_info: PAD token = 0 '[PAD]' print_info: MASK token = 103 '[MASK]' print_info: LF token = 0 '[PAD]' print_info: EOG token = 102 '[SEP]' print_info: max token length = 21 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 12 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 13/13 layers to GPU load_tensors: CPU_Mapped model buffer size = 44.72 MiB load_tensors: Metal_Mapped model buffer size = 216.15 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 8192 llama_context: n_ctx_per_seq = 8192 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 0 llama_context: flash_attn = 0 llama_context: freq_base = 1000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (8192) > n_ctx_train (2048) -- possible training context overflow ggml_metal_init: allocating ggml_metal_init: picking default device: Apple M2 Pro ggml_metal_load_library: using embedded metal library ggml_metal_init: GPU name: Apple M2 Pro ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction = true ggml_metal_init: simdgroup matrix mul. = true ggml_metal_init: has residency sets = true ggml_metal_init: has bfloat = true ggml_metal_init: use bfloat = false ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB ggml_metal_init: skipping kernel_get_rows_bf16 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported) ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported) ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported) ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported) llama_context: CPU output buffer size = 0.00 MiB time=2025-06-08T10:55:46.286+02:00 level=INFO source=server.go:630 msg="llama runner started in 0.25 seconds" decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) [GIN] 2025/06/08 - 10:55:48 | 200 | 2.263241667s | 192.168.2.132 | POST "/api/embed" ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 9.0.0
GiteaMirror added the bug label 2026-04-12 19:18:46 -05:00
Author
Owner

@oatmealm commented on GitHub (Jun 8, 2025):

Sorry. Saw it was reported before #10811

<!-- gh-comment-id:2953803149 --> @oatmealm commented on GitHub (Jun 8, 2025): Sorry. Saw it was reported before #10811
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7264