[GH-ISSUE #7288] embedding generation failed. wsarecv: An existing connection was forcibly closed by the remote host. #66688

Open
opened 2026-05-04 07:48:34 -05:00 by GiteaMirror · 36 comments
Owner

Originally created by @viosay on GitHub (Oct 21, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7288

What is the issue?

embedding model
When I submit a single fragment, it responds normally, but when I submit multiple fragments, an exception occurs.
I encountered this error on different Windows systems as well.
This issue occurs in both versions 0.3.14 and 0.4.0-rc3. However, I also tested versions 0.3.13 and 0.3.10, and they work perfectly.

[GIN] 2024/10/21 - 16:00:29 | 200 |    722.8624ms |   192.168.7.100 | POST     "/api/embed"
ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
time=2024-10-21T16:00:36.434+08:00 level=ERROR source=routes.go:434 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:64075/embedding\": read tcp 127.0.0.1:64078->127.0.0.1:64075: wsarecv: An existing connection was forcibly closed by the remote host."
[GIN] 2024/10/21 - 16:00:36 | 500 |    6.5660285s |   192.168.7.100 | POST     "/api/embed"
time=2024-10-21T16:01:00.723+08:00 level=INFO source=llama-server.go:72 msg="system memory" total="15.9 GiB" free="10.3 GiB" free_swap="8.8 GiB"
time=2024-10-21T16:01:00.726+08:00 level=INFO source=memory.go:346 msg="offload to cpu" layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="687.0 MiB" memory.required.partial="0 B" memory.required.kv="12.0 MiB" memory.required.allocations="[687.0 MiB]" memory.weights.total="589.2 MiB" memory.weights.repeating="548.0 MiB" memory.weights.nonrepeating="41.3 MiB" memory.graph.full="32.0 MiB" memory.graph.partial="32.0 MiB"
time=2024-10-21T16:01:00.730+08:00 level=INFO source=llama-server.go:355 msg="starting llama server" cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 --ctx-size 2048 --batch-size 512 --embedding --threads 4 --no-mmap --parallel 1 --port 64090"
time=2024-10-21T16:01:00.782+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2024-10-21T16:01:00.791+08:00 level=INFO source=llama-server.go:534 msg="waiting for llama runner to start responding"
time=2024-10-21T16:01:00.792+08:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server error"
time=2024-10-21T16:01:00.812+08:00 level=INFO source=runner.go:856 msg="starting go runner"
time=2024-10-21T16:01:00.829+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:64090"
llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from C:\Users\Administrator\.ollama\models\blobs\shp4 _llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = model
llama_model_loader: - kv   2:                           bert.block_count u32              = 24
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,21128]   = ["[PAD]", "[unused1]", "[unused2]",  0 _llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,21128]   = [-1000.000000, -1000.000000, -1000.0 - _llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,21128]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,P3 _llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:  243 tensors
llama_model_loader: - type  f16:  146 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.0769 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 21128
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 512
llm_load_print_meta: n_embd           = 1024
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-012
llm_load_print_meta: f_norm_rms_eps   = 0.0e+000
llm_load_print_meta: f_clamp_kqv      = 0.0e+000
llm_load_print_meta: f_max_alibi_bias = 0.0e+000
llm_load_print_meta: f_logit_scale    = 0.0e+000
llm_load_print_meta: n_ff             = 4096
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 512
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 335M
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 324.47 M
llm_load_print_meta: model size       = 619.50 MiB (16.02 BPW) 
llm_load_print_meta: general.name     = model
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
llm_load_print_meta: EOG token        = 102 '[SEP]'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.16 MiB
llm_load_tensors:        CPU buffer size =   619.50 MiB
time=2024-10-21T16:01:01.048+08:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    26.00 MiB
llama_new_context_with_model: graph nodes  = 851
llama_new_context_with_model: graph splits = 1
time=2024-10-21T16:01:01.299+08:00 level=INFO source=llama-server.go:573 msg="llama runner started in 0.51 seconds"
llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from C:\Users\Administrator\.ollama\models\blobs\sh c s3llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = model
llama_model_loader: - kv   2:                           bert.block_count u32              = 24
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,21128]   = ["[PAD]", "[unused1]", "[unused2]", `w s3llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,21128]   = [-1000.000000, -1000.000000, -1000.0 ~ s3llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,21128]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, u s3llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:  243 tensors
llama_model_loader: - type  f16:  146 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.0769 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 21128
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 1
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 324.47 M
llm_load_print_meta: model size       = 619.50 MiB (16.02 BPW) 
llm_load_print_meta: general.name     = model
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
llm_load_print_meta: EOG token        = 102 '[SEP]'
llm_load_print_meta: max token length = 48
llama_model_load: vocab only - skipping tensors
[GIN] 2024/10/21 - 16:01:01 | 200 |    701.8355ms |   192.168.7.100 | POST     "/api/embed"
ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
time=2024-10-21T16:01:08.177+08:00 level=ERROR source=routes.go:434 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:64090/embedding\": read tcp 127.0.0.1:64093->127.0.0.1:64090: wsarecv: An existing connection was forcibly closed by the remote host."

OS

Windows

GPU

No response

CPU

Intel

Ollama version

0.3.14~0.4.6

Originally created by @viosay on GitHub (Oct 21, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7288 ### What is the issue? embedding model When I submit a single fragment, it responds normally, but when I submit multiple fragments, an exception occurs. I encountered this error on different Windows systems as well. This issue occurs in both versions 0.3.14 and 0.4.0-rc3. However, I also tested versions 0.3.13 and 0.3.10, and they work perfectly. ``` [GIN] 2024/10/21 - 16:00:29 | 200 | 722.8624ms | 192.168.7.100 | POST "/api/embed" ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed time=2024-10-21T16:00:36.434+08:00 level=ERROR source=routes.go:434 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:64075/embedding\": read tcp 127.0.0.1:64078->127.0.0.1:64075: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2024/10/21 - 16:00:36 | 500 | 6.5660285s | 192.168.7.100 | POST "/api/embed" time=2024-10-21T16:01:00.723+08:00 level=INFO source=llama-server.go:72 msg="system memory" total="15.9 GiB" free="10.3 GiB" free_swap="8.8 GiB" time=2024-10-21T16:01:00.726+08:00 level=INFO source=memory.go:346 msg="offload to cpu" layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="687.0 MiB" memory.required.partial="0 B" memory.required.kv="12.0 MiB" memory.required.allocations="[687.0 MiB]" memory.weights.total="589.2 MiB" memory.weights.repeating="548.0 MiB" memory.weights.nonrepeating="41.3 MiB" memory.graph.full="32.0 MiB" memory.graph.partial="32.0 MiB" time=2024-10-21T16:01:00.730+08:00 level=INFO source=llama-server.go:355 msg="starting llama server" cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 --ctx-size 2048 --batch-size 512 --embedding --threads 4 --no-mmap --parallel 1 --port 64090" time=2024-10-21T16:01:00.782+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2024-10-21T16:01:00.791+08:00 level=INFO source=llama-server.go:534 msg="waiting for llama runner to start responding" time=2024-10-21T16:01:00.792+08:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server error" time=2024-10-21T16:01:00.812+08:00 level=INFO source=runner.go:856 msg="starting go runner" time=2024-10-21T16:01:00.829+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:64090" llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from C:\Users\Administrator\.ollama\models\blobs\shp4 _llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = model llama_model_loader: - kv 2: bert.block_count u32 = 24 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 1024 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 4096 llama_model_loader: - kv 6: bert.attention.head_count u32 = 16 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,21128] = ["[PAD]", "[unused1]", "[unused2]", 0 _llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,21128] = [-1000.000000, -1000.000000, -1000.0 - _llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,21128] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,P3 _llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 243 tensors llama_model_loader: - type f16: 146 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 5 llm_load_vocab: token to piece cache size = 0.0769 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 21128 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 1024 llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 1.0e-012 llm_load_print_meta: f_norm_rms_eps = 0.0e+000 llm_load_print_meta: f_clamp_kqv = 0.0e+000 llm_load_print_meta: f_max_alibi_bias = 0.0e+000 llm_load_print_meta: f_logit_scale = 0.0e+000 llm_load_print_meta: n_ff = 4096 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 335M llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 324.47 M llm_load_print_meta: model size = 619.50 MiB (16.02 BPW) llm_load_print_meta: general.name = model llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' llm_load_print_meta: EOG token = 102 '[SEP]' llm_load_print_meta: max token length = 48 llm_load_tensors: ggml ctx size = 0.16 MiB llm_load_tensors: CPU buffer size = 619.50 MiB time=2024-10-21T16:01:01.048+08:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server loading model" llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 192.00 MiB llama_new_context_with_model: KV self size = 192.00 MiB, K (f16): 96.00 MiB, V (f16): 96.00 MiB llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: CPU compute buffer size = 26.00 MiB llama_new_context_with_model: graph nodes = 851 llama_new_context_with_model: graph splits = 1 time=2024-10-21T16:01:01.299+08:00 level=INFO source=llama-server.go:573 msg="llama runner started in 0.51 seconds" llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from C:\Users\Administrator\.ollama\models\blobs\sh c s3llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = model llama_model_loader: - kv 2: bert.block_count u32 = 24 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 1024 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 4096 llama_model_loader: - kv 6: bert.attention.head_count u32 = 16 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,21128] = ["[PAD]", "[unused1]", "[unused2]", `w s3llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,21128] = [-1000.000000, -1000.000000, -1000.0 ~ s3llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,21128] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, u s3llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 243 tensors llama_model_loader: - type f16: 146 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 5 llm_load_vocab: token to piece cache size = 0.0769 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 21128 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 324.47 M llm_load_print_meta: model size = 619.50 MiB (16.02 BPW) llm_load_print_meta: general.name = model llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' llm_load_print_meta: EOG token = 102 '[SEP]' llm_load_print_meta: max token length = 48 llama_model_load: vocab only - skipping tensors [GIN] 2024/10/21 - 16:01:01 | 200 | 701.8355ms | 192.168.7.100 | POST "/api/embed" ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed time=2024-10-21T16:01:08.177+08:00 level=ERROR source=routes.go:434 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:64090/embedding\": read tcp 127.0.0.1:64093->127.0.0.1:64090: wsarecv: An existing connection was forcibly closed by the remote host." ``` ### OS Windows ### GPU _No response_ ### CPU Intel ### Ollama version 0.3.14~0.4.6
GiteaMirror added the bug label 2026-05-04 07:48:34 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 21, 2024):

Which model are you using?

<!-- gh-comment-id:2426661407 --> @rick-github commented on GitHub (Oct 21, 2024): Which model are you using?
Author
Owner

@viosay commented on GitHub (Oct 21, 2024):

Which model are you using?

I tried many embedding models, including 893379029/piccolo-large-zh-v2 and viosay/conan-embedding-v1, and they all have the same issue, although they worked perfectly fine before. However, a few models, like shaw/dmeta-embedding-zh, do not have this problem.

<!-- gh-comment-id:2426975574 --> @viosay commented on GitHub (Oct 21, 2024): > Which model are you using? I tried many embedding models, including 893379029/piccolo-large-zh-v2 and viosay/conan-embedding-v1, and they all have the same issue, although they worked perfectly fine before. However, a few models, like shaw/dmeta-embedding-zh, do not have this problem.
Author
Owner

@rick-github commented on GitHub (Oct 21, 2024):

I am unable to replicate:

$ curl -s http://localhost:11434/api/version
{"version":"0.3.14"}
$ curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":"Why is the sky blue?"}' | jq '.embeddings=[.embeddings[]|length]'
{
  "model": "viosay/conan-embedding-v1",
  "embeddings": [
    1024
  ],
  "total_duration": 154636185,
  "load_duration": 2031329,
  "prompt_eval_count": 6
}
$ curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":["Why is the sky blue?","why is the grass green"]}' | jq '.embeddings=[.embeddings[]|length]'
{
  "model": "viosay/conan-embedding-v1",
  "embeddings": [
    1024,
    1024
  ],
  "total_duration": 306792338,
  "load_duration": 2301978,
  "prompt_eval_count": 13
}
$ curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":["Why is the sky blue?","why is the grass green","one","two","three"]}' | jq '.embeddings=[.embe
ddings[]|length]'
{
  "model": "viosay/conan-embedding-v1",
  "embeddings": [
    1024,
    1024,
    1024,
    1024,
    1024
  ],
  "total_duration": 628749251,
  "load_duration": 3847279,
  "prompt_eval_count": 16
}

It might have something to with the client or the length of the inputs. Can you provide more context on your usage, or better yet, a script that demonstrates the problem.

<!-- gh-comment-id:2427100816 --> @rick-github commented on GitHub (Oct 21, 2024): I am unable to replicate: ```console $ curl -s http://localhost:11434/api/version {"version":"0.3.14"} $ curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":"Why is the sky blue?"}' | jq '.embeddings=[.embeddings[]|length]' { "model": "viosay/conan-embedding-v1", "embeddings": [ 1024 ], "total_duration": 154636185, "load_duration": 2031329, "prompt_eval_count": 6 } $ curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":["Why is the sky blue?","why is the grass green"]}' | jq '.embeddings=[.embeddings[]|length]' { "model": "viosay/conan-embedding-v1", "embeddings": [ 1024, 1024 ], "total_duration": 306792338, "load_duration": 2301978, "prompt_eval_count": 13 } $ curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":["Why is the sky blue?","why is the grass green","one","two","three"]}' | jq '.embeddings=[.embe ddings[]|length]' { "model": "viosay/conan-embedding-v1", "embeddings": [ 1024, 1024, 1024, 1024, 1024 ], "total_duration": 628749251, "load_duration": 3847279, "prompt_eval_count": 16 } ``` It might have something to with the client or the length of the inputs. Can you provide more context on your usage, or better yet, a script that demonstrates the problem.
Author
Owner

@viosay commented on GitHub (Oct 22, 2024):

@rick-github You're right. Based on my tests, the issue is indeed related to the input length. When it exceeds a certain length, an error occurs.
It’s like the example below, where an error occurred.

curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}'
<!-- gh-comment-id:2428011410 --> @viosay commented on GitHub (Oct 22, 2024): @rick-github You're right. Based on my tests, the issue is indeed related to the input length. When it exceeds a certain length, an error occurs. It’s like the example below, where an error occurred. ``` curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' ```
Author
Owner

@rick-github commented on GitHub (Oct 22, 2024):

viosay/conan-embedding-v1 has an embedding length of 1024 and your test text is 1905 bytes, so it's exceeding the window. The client should chunk the text to segments smaller than the embedding length otherwise the returned embeddings will be missing semantic content. However, ollama (or actually llama.cpp) should handle the situation more gracefully.

<!-- gh-comment-id:2428017778 --> @rick-github commented on GitHub (Oct 22, 2024): viosay/conan-embedding-v1 has an embedding length of 1024 and your test text is 1905 bytes, so it's exceeding the window. The client should chunk the text to segments smaller than the embedding length otherwise the returned embeddings will be missing semantic content. However, ollama (or actually llama.cpp) should handle the situation more gracefully.
Author
Owner

@viosay commented on GitHub (Oct 22, 2024):

viosay/conan-embedding-v1 has an embedding length of 1024 and your test text is 1905 bytes, so it's exceeding the window. The client should chunk the text to segments smaller than the embedding length otherwise the returned embeddings will be missing semantic content. However, ollama (or actually llama.cpp) should handle the situation more gracefully.

Thank you for your response! The key issue is that this problem did not exist in versions prior to Ollama 0.3.14. I had been using various models without any issues before that.

<!-- gh-comment-id:2428020763 --> @viosay commented on GitHub (Oct 22, 2024): > viosay/conan-embedding-v1 has an embedding length of 1024 and your test text is 1905 bytes, so it's exceeding the window. The client should chunk the text to segments smaller than the embedding length otherwise the returned embeddings will be missing semantic content. However, ollama (or actually llama.cpp) should handle the situation more gracefully. Thank you for your response! The key issue is that this problem did not exist in versions prior to Ollama 0.3.14. I had been using various models without any issues before that.
Author
Owner

@viosay commented on GitHub (Oct 22, 2024):

For example, the shaw/dmeta-embedding-zh model, which has an embedding length of 768, does not encounter this issue. Both the new and old versions do not have this issue.

<!-- gh-comment-id:2428026896 --> @viosay commented on GitHub (Oct 22, 2024): For example, the shaw/dmeta-embedding-zh model, which has an embedding length of 768, does not encounter this issue. Both the new and old versions do not have this issue.
Author
Owner

@rick-github commented on GitHub (Oct 22, 2024):

ollama moved to a more recent llama.cpp snapshot for the granite model support (f2890a4494) and presumably that has introduced some problems with embedding calls. I don't see any recent issues regarding that in the llama.cpp issue tracker, so this is not affecting too many users.

Exceeding the length doesn't mean that all models will fail, some may be more resilient. I'll dig a bit more and file an issue with llama.cpp if this is the actual problem. In the meantime, you should adjust the text chunking anyway, as the embeddings will not contain all of the information in the original text.

<!-- gh-comment-id:2428029457 --> @rick-github commented on GitHub (Oct 22, 2024): ollama moved to a more recent llama.cpp snapshot for the granite model support (https://github.com/ollama/ollama/commit/f2890a4494f9fb3722ee7a4c506252362d1eab65) and presumably that has introduced some problems with embedding calls. I don't see any recent issues regarding that in the llama.cpp issue tracker, so this is not affecting too many users. Exceeding the length doesn't mean that all models will fail, some may be more resilient. I'll dig a bit more and file an issue with llama.cpp if this is the actual problem. In the meantime, you should adjust the text chunking anyway, as the embeddings will not contain all of the information in the original text.
Author
Owner

@viosay commented on GitHub (Oct 22, 2024):

@rick-github Thank you very much! I will follow your advice. One more thing I noticed is that most of the models that encounter issues are those imported after being converted to GGUF using the convert_hf_to_gguf.py script from llama.cpp. I'm not sure if this is the cause of the problem.

<!-- gh-comment-id:2428034136 --> @viosay commented on GitHub (Oct 22, 2024): @rick-github Thank you very much! I will follow your advice. One more thing I noticed is that most of the models that encounter issues are those imported after being converted to GGUF using the `convert_hf_to_gguf.py` script from llama.cpp. I'm not sure if this is the cause of the problem.
Author
Owner

@rick-github commented on GitHub (Oct 22, 2024):

Just to correct a mistake I made, viosay/conan-embedding-v1 has a limit of 512 tokens, and shaw/dmeta-embedding-zh a limit of 1024 tokens. The embedding length is the size of the generated embeddings.

<!-- gh-comment-id:2428034250 --> @rick-github commented on GitHub (Oct 22, 2024): Just to correct a mistake I made, viosay/conan-embedding-v1 has a limit of 512 tokens, and shaw/dmeta-embedding-zh a limit of 1024 tokens. The embedding length is the size of the generated embeddings.
Author
Owner

@viosay commented on GitHub (Oct 22, 2024):

Yes, I understand the meaning of embedding length. Whether it’s 512 or 1024, they are both less than 1905. This is a puzzling issue, and as you mentioned, it seems to have arisen after Ollama updated llama.cpp. I'll continue testing and verifying the specific situation. Thank you!

<!-- gh-comment-id:2428040620 --> @viosay commented on GitHub (Oct 22, 2024): Yes, I understand the meaning of embedding length. Whether it’s 512 or 1024, they are both less than 1905. This is a puzzling issue, and as you mentioned, it seems to have arisen after Ollama updated llama.cpp. I'll continue testing and verifying the specific situation. Thank you!
Author
Owner

@rick-github commented on GitHub (Oct 22, 2024):

Tokens are different to characters. A token is a sequence of characters, on average 2 or 3 characters in length. So a token length of 512 would handle 1024-1536 characters, and a token length of 1024 would handle 2048-3072 characters.

<!-- gh-comment-id:2428045122 --> @rick-github commented on GitHub (Oct 22, 2024): Tokens are different to characters. A token is a sequence of characters, on average 2 or 3 characters in length. So a token length of 512 would handle 1024-1536 characters, and a token length of 1024 would handle 2048-3072 characters.
Author
Owner

@viosay commented on GitHub (Oct 25, 2024):

I think I’ve figured out the issue; it seems that the truncate didn’t take effect.
image

<!-- gh-comment-id:2436618815 --> @viosay commented on GitHub (Oct 25, 2024): I think I’ve figured out the issue; it seems that the `truncate` didn’t take effect. ![image](https://github.com/user-attachments/assets/1d1606c2-62a6-4ab6-b445-2ee6c95e4c99)
Author
Owner

@rick-github commented on GitHub (Oct 25, 2024):

It does truncate, it's just the runner throws an GGML_ASSERT(i01 >= 0 && i01 < ne01) failed exception and crashes when the number of tokens is close to the maximum allowed and the runner has been started with a context window greater than the actual supported value. If the model is loaded with the context size set to the actual supported context size, it works fine:

curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","options":{"num_ctx":512},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' | jq '.embeddings=[.embeddings[]|length]'
<!-- gh-comment-id:2436688098 --> @rick-github commented on GitHub (Oct 25, 2024): It does truncate, it's just the runner throws an `GGML_ASSERT(i01 >= 0 && i01 < ne01) failed` exception and crashes when the number of tokens is close to the maximum allowed and the runner has been started with a context window greater than the actual supported value. If the model is loaded with the context size set to the actual supported context size, it works fine: ```console curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","options":{"num_ctx":512},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' | jq '.embeddings=[.embeddings[]|length]' ```
Author
Owner

@mokby commented on GitHub (Oct 25, 2024):

@viosay Hi, I met the same error, do you have any solution to solve it? I test many embeddings, but only mxbai-embed-large and nomic-embed-text works fine.

<!-- gh-comment-id:2437005882 --> @mokby commented on GitHub (Oct 25, 2024): @viosay Hi, I met the same error, do you have any solution to solve it? I test many embeddings, but only `mxbai-embed-large` and `nomic-embed-text` works fine.
Author
Owner

@viosay commented on GitHub (Oct 25, 2024):

@viosay Hi, I met the same error, do you have any solution to solve it? I test many embeddings, but only mxbai-embed-large and nomic-embed-text works fine.

There’s no good solution for now. Either roll back to version 0.3.13, or try setting the actual context size as Rick suggested above. However, there’s a risk of semantic loss in the returned embeddings. It might be best to first try splitting the text into chunks smaller than the embedding length on the client side, but due to differences in token calculation methods, your chunks may not match the model’s segmentation precisely. For example, when I use the jtokkit library from SpringAI for token calculation and segmentation, it oddly treats multiple consecutive dots as a single token. This results in the actual chunks being much larger than expected, which still causes the error.

<!-- gh-comment-id:2437327193 --> @viosay commented on GitHub (Oct 25, 2024): > @viosay Hi, I met the same error, do you have any solution to solve it? I test many embeddings, but only `mxbai-embed-large` and `nomic-embed-text` works fine. There’s no good solution for now. Either roll back to version 0.3.13, or try setting the actual context size as Rick suggested above. However, there’s a risk of semantic loss in the returned embeddings. It might be best to first try splitting the text into chunks smaller than the embedding length on the client side, but due to differences in token calculation methods, your chunks may not match the model’s segmentation precisely. For example, when I use the jtokkit library from SpringAI for token calculation and segmentation, it oddly treats multiple consecutive dots as a single token. This results in the actual chunks being much larger than expected, which still causes the error.
Author
Owner

@zydmtaichi commented on GitHub (Oct 29, 2024):

Hi @viosay and @rick-github ,
I met the same error and tried to reduce the length of chunk tokens sent to embedding api. However, it seems not working. I use model milkey/gte:large-zh-f16 and set embedding_dim to 1024 with chunk_token_size equals to 1200 in lightrag framework which access embed model via ollama. The 500 internal err is not changed even I modify the chunk_token_size to 400.

<!-- gh-comment-id:2443620117 --> @zydmtaichi commented on GitHub (Oct 29, 2024): Hi @viosay and @rick-github , I met the same error and tried to reduce the length of chunk tokens sent to embedding api. However, it seems not working. I use model `milkey/gte:large-zh-f16` and set embedding_dim to 1024 with chunk_token_size equals to 1200 in lightrag framework which access embed model via ollama. The 500 internal err is not changed even I modify the chunk_token_size to 400.
Author
Owner

@rick-github commented on GitHub (Oct 29, 2024):

It works if I set num_ctx to 512. Perhaps the lightrag framework is adding extra tokens, or there is an issue wiith chunk_token_size.

$ ollama show milkey/gte:large-zh-f16
  Model                 
  	arch            	bert	  
  	parameters      	324M	  
  	quantization    	F16 	  
  	context length  	512 	  
  	embedding length	1024	  
  	                      

$ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":1200},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}'
{"error":{}}
$ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":512},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' | jq '.embeddings=[.embeddings[]|length]'
{
  "model": "milkey/gte:large-zh-f16",
  "embeddings": [
    1024
  ],
  "total_duration": 5604443786,
  "load_duration": 5467363275,
  "prompt_eval_count": 512
}
<!-- gh-comment-id:2444565529 --> @rick-github commented on GitHub (Oct 29, 2024): It works if I set `num_ctx` to 512. Perhaps the lightrag framework is adding extra tokens, or [there is an issue wiith `chunk_token_size`](https://github.com/HKUDS/LightRAG/issues/102). ```console $ ollama show milkey/gte:large-zh-f16 Model arch bert parameters 324M quantization F16 context length 512 embedding length 1024 $ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":1200},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' {"error":{}} $ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":512},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' | jq '.embeddings=[.embeddings[]|length]' { "model": "milkey/gte:large-zh-f16", "embeddings": [ 1024 ], "total_duration": 5604443786, "load_duration": 5467363275, "prompt_eval_count": 512 } ```
Author
Owner

@mokby commented on GitHub (Oct 30, 2024):

It works if I set num_ctx to 512. Perhaps the lightrag framework is adding extra tokens, or there is an issue wiith chunk_token_size.

$ ollama show milkey/gte:large-zh-f16
  Model                 
  	arch            	bert	  
  	parameters      	324M	  
  	quantization    	F16 	  
  	context length  	512 	  
  	embedding length	1024	  
  	                      

$ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":1200},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}'
{"error":{}}
$ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":512},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' | jq '.embeddings=[.embeddings[]|length]'
{
  "model": "milkey/gte:large-zh-f16",
  "embeddings": [
    1024
  ],
  "total_duration": 5604443786,
  "load_duration": 5467363275,
  "prompt_eval_count": 512
}

Amazing! That works, thanks for your help!

<!-- gh-comment-id:2445626635 --> @mokby commented on GitHub (Oct 30, 2024): > It works if I set `num_ctx` to 512. Perhaps the lightrag framework is adding extra tokens, or [there is an issue wiith `chunk_token_size`](https://github.com/HKUDS/LightRAG/issues/102). > > ``` > $ ollama show milkey/gte:large-zh-f16 > Model > arch bert > parameters 324M > quantization F16 > context length 512 > embedding length 1024 > > > $ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":1200},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' > {"error":{}} > $ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":512},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' | jq '.embeddings=[.embeddings[]|length]' > { > "model": "milkey/gte:large-zh-f16", > "embeddings": [ > 1024 > ], > "total_duration": 5604443786, > "load_duration": 5467363275, > "prompt_eval_count": 512 > } > ``` Amazing! That works, thanks for your help!
Author
Owner

@viosay commented on GitHub (Oct 30, 2024):

@mokby This is what I mentioned above about Rick's suggestion to set the actual context size. However, there is a risk of semantic loss in the returned embeddings. The engine might enforce input truncation, and the real reason has yet to be identified. The most likely cause is still issues with token segmentation and calculation.

<!-- gh-comment-id:2445636574 --> @viosay commented on GitHub (Oct 30, 2024): @mokby This is what I mentioned above about Rick's suggestion to set the actual context size. However, there is a risk of semantic loss in the returned embeddings. The engine might enforce input truncation, and the real reason has yet to be identified. The most likely cause is still issues with token segmentation and calculation.
Author
Owner

@mokby commented on GitHub (Oct 30, 2024):

Yeah, that's may be a potential problem, can you share your solution if you can handle this issue? Many thanks

<!-- gh-comment-id:2445640862 --> @mokby commented on GitHub (Oct 30, 2024): Yeah, that's may be a potential problem, can you share your solution if you can handle this issue? Many thanks
Author
Owner

@viosay commented on GitHub (Nov 16, 2024):

ChatGPT pointed out that the issue lies in the llama_server startup command, where ctx-size was set to 2048, but it should actually be 512. But compared to version 0.3.13, they are consistent.

cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 --ctx-size 2048 --batch-size 512 --threads 6 --no-mmap --parallel 1 --port 7220"
<!-- gh-comment-id:2480295868 --> @viosay commented on GitHub (Nov 16, 2024): ChatGPT pointed out that the issue lies in the llama_server startup command, where `ctx-size` was set to 2048, but it should actually be 512. But compared to version 0.3.13, they are consistent. ``` cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 --ctx-size 2048 --batch-size 512 --threads 6 --no-mmap --parallel 1 --port 7220" ```
Author
Owner

@viosay commented on GitHub (Nov 27, 2024):

Provide a reproduction along with the debug logs when OLLAMA_DEBUG=1 is enabled.
And found that the content in the log output is inconsistent with the input text.
OpenAI's tokenizer Through calculation, it can be observed that the input tokens do not exceed the model's maximum supported token count of 512.


curl -s localhost:11434/api/embed -d '{"input":[" ................................................................................................................. 78\n\n ......................................................................................................... 79\n\n ......................................................................................................... 80\n\n ................................................................................................................. 80\n\n ................................................................................................................. 80\n\n ................................................................................................................. 81\n\n\n\n .......................................................................... 81\n\n .................................................................81\n\n......................................................................................................... 82\n\n ............................................................................................................. 83\n\n .......................................................................................................... 84\n\n .......................................................................................................................85\n\n .............................................................................................................. 86\n\n .............................................................................................................. 89\n\n .......................................................................................................................90\n\n .......................................................................................................................93\n\n ...................................................................................96\n\n ............................................................................. 96\n\n .....................................................96\n\n ................................................................................... 101\n\n ................................................................................... 107\n\n ........................................................................................"],"model":"viosay/conan-embedding-v1","options":{}}'
time=2024-11-27T16:39:48.728+08:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model"
time=2024-11-27T16:39:48.728+08:00 level=DEBUG source=server.go:607 msg="model load progress 0.08"
time=2024-11-27T16:39:48.991+08:00 level=DEBUG source=server.go:607 msg="model load progress 0.19"
time=2024-11-27T16:39:49.254+08:00 level=DEBUG source=server.go:607 msg="model load progress 0.28"
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    26.00 MiB
llama_new_context_with_model: graph nodes  = 851
llama_new_context_with_model: graph splits = 1
time=2024-11-27T16:39:49.516+08:00 level=INFO source=server.go:601 msg="llama runner started in 1.05 seconds"
time=2024-11-27T16:39:49.516+08:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68
time=2024-11-27T16:39:49.524+08:00 level=DEBUG source=server.go:965 msg="new runner detected, loading model for cgo tokenization"
llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = model
llama_model_loader: - kv   2:                           bert.block_count u32              = 24
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,21128]   = ["[PAD]", "[unused1]", "[unused2]", "...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,21128]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,21128]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:  243 tensors
llama_model_loader: - type  f16:  146 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.0769 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 21128
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 1
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 324.47 M
llm_load_print_meta: model size       = 619.50 MiB (16.02 BPW) 
llm_load_print_meta: general.name     = model
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
llm_load_print_meta: EOG token        = 102 '[SEP]'
llm_load_print_meta: max token length = 48
llama_model_load: vocab only - skipping tensors
time=2024-11-27T16:39:49.545+08:00 level=DEBUG source=runner.go:744 msg="embedding request" content=" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ."
time=2024-11-27T16:39:49.551+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=514 used=0 remaining=514
ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failedggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed

time=2024-11-27T16:39:51.654+08:00 level=ERROR source=routes.go:453 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:10390/embedding\": read tcp 127.0.0.1:10395->127.0.0.1:10390: wsarecv: An existing connection was forcibly closed by the remote host."
[GIN] 2024/11/27 - 16:39:51 | 500 |    3.2233719s |   192.168.7.247 | POST     "/api/embed"
time=2024-11-27T16:39:51.654+08:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2024-11-27T16:39:51.654+08:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 duration=5m0s
time=2024-11-27T16:39:51.654+08:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 refCount=0
time=2024-11-27T16:39:51.710+08:00 level=DEBUG source=server.go:423 msg="llama runner terminated" error="exit status 0xc0000409"
<!-- gh-comment-id:2503273262 --> @viosay commented on GitHub (Nov 27, 2024): Provide a reproduction along with the debug logs when `OLLAMA_DEBUG=1` is enabled. And found that the content in the log output is inconsistent with the input text. [OpenAI's tokenizer](https://platform.openai.com/tokenizer) Through calculation, it can be observed that the input tokens do not exceed the model's maximum supported token count of 512. ![](https://github.com/user-attachments/assets/8f5aa78e-cbe3-4d6c-98fc-cf0ef32463ea) ![](https://github.com/user-attachments/assets/75ee7519-aeab-440f-a4ef-1f0c4b759f69) ![](https://github.com/user-attachments/assets/03058231-6b8b-4cff-899c-16270e0822d6) ``` curl -s localhost:11434/api/embed -d '{"input":[" ................................................................................................................. 78\n\n ......................................................................................................... 79\n\n ......................................................................................................... 80\n\n ................................................................................................................. 80\n\n ................................................................................................................. 80\n\n ................................................................................................................. 81\n\n\n\n .......................................................................... 81\n\n .................................................................81\n\n......................................................................................................... 82\n\n ............................................................................................................. 83\n\n .......................................................................................................... 84\n\n .......................................................................................................................85\n\n .............................................................................................................. 86\n\n .............................................................................................................. 89\n\n .......................................................................................................................90\n\n .......................................................................................................................93\n\n ...................................................................................96\n\n ............................................................................. 96\n\n .....................................................96\n\n ................................................................................... 101\n\n ................................................................................... 107\n\n ........................................................................................"],"model":"viosay/conan-embedding-v1","options":{}}' ``` ``` time=2024-11-27T16:39:48.728+08:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model" time=2024-11-27T16:39:48.728+08:00 level=DEBUG source=server.go:607 msg="model load progress 0.08" time=2024-11-27T16:39:48.991+08:00 level=DEBUG source=server.go:607 msg="model load progress 0.19" time=2024-11-27T16:39:49.254+08:00 level=DEBUG source=server.go:607 msg="model load progress 0.28" llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 192.00 MiB llama_new_context_with_model: KV self size = 192.00 MiB, K (f16): 96.00 MiB, V (f16): 96.00 MiB llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: CPU compute buffer size = 26.00 MiB llama_new_context_with_model: graph nodes = 851 llama_new_context_with_model: graph splits = 1 time=2024-11-27T16:39:49.516+08:00 level=INFO source=server.go:601 msg="llama runner started in 1.05 seconds" time=2024-11-27T16:39:49.516+08:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 time=2024-11-27T16:39:49.524+08:00 level=DEBUG source=server.go:965 msg="new runner detected, loading model for cgo tokenization" llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = model llama_model_loader: - kv 2: bert.block_count u32 = 24 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 1024 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 4096 llama_model_loader: - kv 6: bert.attention.head_count u32 = 16 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,21128] = ["[PAD]", "[unused1]", "[unused2]", "... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,21128] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,21128] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 243 tensors llama_model_loader: - type f16: 146 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 5 llm_load_vocab: token to piece cache size = 0.0769 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 21128 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 324.47 M llm_load_print_meta: model size = 619.50 MiB (16.02 BPW) llm_load_print_meta: general.name = model llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' llm_load_print_meta: EOG token = 102 '[SEP]' llm_load_print_meta: max token length = 48 llama_model_load: vocab only - skipping tensors time=2024-11-27T16:39:49.545+08:00 level=DEBUG source=runner.go:744 msg="embedding request" content=" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ." time=2024-11-27T16:39:49.551+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=514 used=0 remaining=514 ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failedggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed time=2024-11-27T16:39:51.654+08:00 level=ERROR source=routes.go:453 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:10390/embedding\": read tcp 127.0.0.1:10395->127.0.0.1:10390: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2024/11/27 - 16:39:51 | 500 | 3.2233719s | 192.168.7.247 | POST "/api/embed" time=2024-11-27T16:39:51.654+08:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-11-27T16:39:51.654+08:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 duration=5m0s time=2024-11-27T16:39:51.654+08:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 refCount=0 time=2024-11-27T16:39:51.710+08:00 level=DEBUG source=server.go:423 msg="llama runner terminated" error="exit status 0xc0000409" ```
Author
Owner

@viosay commented on GitHub (Nov 27, 2024):

@rick-github I hope the latest debugging logs I provide will help identify the issue. Thanks.
Additionally, after calculation, it was confirmed that the text in the previous example does not exceed the 512-token limit.


<!-- gh-comment-id:2503288963 --> @viosay commented on GitHub (Nov 27, 2024): @rick-github I hope the latest debugging logs I provide will help identify the issue. Thanks. Additionally, after calculation, it was confirmed that the text in the previous example does not exceed the 512-token limit. ![](https://github.com/user-attachments/assets/a838ff10-7edb-4a06-9ef9-3b16d5ad0244) ![](https://github.com/user-attachments/assets/90f7b826-616e-44af-a419-3f42322d9e5b) ![](https://github.com/user-attachments/assets/889eee6c-189e-4ff1-9d21-839b2944d668)
Author
Owner

@rick-github commented on GitHub (Dec 23, 2024):

The tokenizer used by OpenaAI is different to the tokenizer used by conan-embedding-v1. You can see from your screenshots that all three OpenAI models return a different token count for the same text. The prompt that you are using with OpenAI is not quite the same as the one you provided (1893 characters vs 1905 characters) so we need to knock off a couple of tokens, but conan-embedding-v1 creates 571 tokens.

image

<!-- gh-comment-id:2558886886 --> @rick-github commented on GitHub (Dec 23, 2024): The tokenizer used by OpenaAI is different to the tokenizer used by conan-embedding-v1. You can see from your screenshots that all three OpenAI models return a different token count for the same text. The prompt that you are using with OpenAI is not quite the same as the one you provided (1893 characters vs 1905 characters) so we need to knock off a couple of tokens, but conan-embedding-v1 creates 571 tokens. ![image](https://github.com/user-attachments/assets/728ebf56-77a8-47cd-a6ca-5f356d33f3ed)
Author
Owner

@rick-github commented on GitHub (Jan 15, 2025):

Just to summarize the content from above:

The problem is that the context length that ollama is using is longer than the context length that the embedding model supports. If num_ctx is not supplied in the API call or the Modelfile, ollama will use a default context length of 2048. If this is longer than the context length of the library, a client can send a request longer than the model can accommodate and can cause the runner to crash.

The models in the ollama library currently have the attributes in the table below. Models that are a crash risk with the default parameters are marked:

model model context_length Modelfile num_ctx effective num_ctx crash
nomic-embed-text 2048 8192 2048
mxbai-embed-large 512 512 512
snowflake-arctic-embed 512 - 2048
all-minilm 512 256 256
bge-m3 8192 - 2048
bge-large 512 - 2048
paraphrase-multilingual 512 128 128
snowflake-arctic-embed2 8192 - 2048
granite-embedding 512 - 2048

You can prevent these errors by setting num_ctx in the API call (eg "options":{"num_ctx":512}), or modifying the model to specify the context length:

ollama cp bge-large:latest bge-large:original
ollama rm bge-large:latest
ollama show --modelfile bge-large:original > Modelfile
echo PARAMETER num_ctx 512 >> Modelfile
ollama create -f Modelfile bge-large:latest

Note that the reason the errors are occurring is because ollama is getting embeddings for text lengths greater than that supported by the model. This means that the text will be truncated and the embeddings will be losing semantic content. The chunk size of the embedding client should be less than context_length.

<!-- gh-comment-id:2591709109 --> @rick-github commented on GitHub (Jan 15, 2025): Just to summarize the content from above: The problem is that the context length that ollama is using is longer than the context length that the embedding model supports. If `num_ctx` is not supplied in the API call or the Modelfile, ollama will use a default context length of 2048. If this is longer than the context length of the library, a client can send a request longer than the model can accommodate and can cause the runner to crash. The models in the ollama library currently have the attributes in the table below. Models that are a crash risk with the default parameters are marked: | model | model context_length | Modelfile num_ctx | effective num_ctx | crash | |---|---|---|---|---| | [nomic-embed-text](https://ollama.com/library/nomic-embed-text) | 2048 | 8192 | 2048 | | | [mxbai-embed-large](https://ollama.com/library/mxbai-embed-large) | 512 | 512 | 512 | | | [snowflake-arctic-embed](https://ollama.com/library/snowflake-arctic-embed) | 512 | - | 2048 | ✔ | | [all-minilm](https://ollama.com/library/all-minilm) | 512 | 256 | 256 | | | [bge-m3](https://ollama.com/library/bge-m3) | 8192 | - | 2048 | | | [bge-large](https://ollama.com/library/bge-large) | 512 | - | 2048 | ✔ | | [paraphrase-multilingual](https://ollama.com/library/paraphrase-multilingual) | 512 | 128 | 128 | | | [snowflake-arctic-embed2](https://ollama.com/library/snowflake-arctic-embed2) | 8192 | - | 2048 | | | [granite-embedding](https://ollama.com/library/granite-embedding) | 512 | - | 2048 | ✔ | You can prevent these errors by setting `num_ctx` in the API call (eg `"options":{"num_ctx":512}`), or modifying the model to specify the context length: ```console ollama cp bge-large:latest bge-large:original ollama rm bge-large:latest ollama show --modelfile bge-large:original > Modelfile echo PARAMETER num_ctx 512 >> Modelfile ollama create -f Modelfile bge-large:latest ``` Note that the reason the errors are occurring is because ollama is getting embeddings for text lengths greater than that supported by the model. This means that the text will be truncated and the embeddings will be losing semantic content. The chunk size of the embedding client should be less than `context_length`.
Author
Owner

@rick-github commented on GitHub (Jan 15, 2025):

To dig a bit deeper: the root cause is a mis-calculation in the truncation logic. The prompt is truncated to num_ctx at the entry point of the API, but further down the call tree BOS and EOS tokens are added, taking the input buffer to (say) 514 tokens rather than 512. There's more logic in the cache that tries to handle this but doesn't work when num_ctx >> context_length. Unfortunately, when the cache logic does kick in, it removes tokens from the start of the input, which is likely to impact the usefulness of the embedding.

<!-- gh-comment-id:2591771511 --> @rick-github commented on GitHub (Jan 15, 2025): To dig a bit deeper: the root cause is a mis-calculation in the truncation logic. The prompt is truncated to `num_ctx` at the entry point of the API, but further down the call tree BOS and EOS tokens are added, taking the input buffer to (say) 514 tokens rather than 512. There's more logic in the cache that tries to handle this but doesn't work when num_ctx >> context_length. Unfortunately, when the cache logic does kick in, it removes tokens from the start of the input, which is likely to impact the usefulness of the embedding.
Author
Owner

@viosay commented on GitHub (Feb 26, 2025):

Reconsidering this issue, when an exception occurs, the request input is completely normal, with characters properly segmented according to length, as shown in the example below. However, after reviewing the debug, it was found that spaces were added between each character in the output request content, which resulted in an increase in the number of tokens. The length exceeds the limit after adding spaces, causing an overflow and resulting in an error at 1026.

curl -s 192.168.7.210:11434/api/embed -d '{"model":"shaw/dmeta-embedding-zh","input":"府可以从白皮书中借鉴到如何体系性规划,服务企业并促进经济建设的思\\路;工业企业可以了解头部企业在工业 APP 开发、应用、技术转化等方面\\n的经验;平台企业可以从白皮书中归纳分析出自己平台的发展方向与应用\\n推进路径。各主体都能够从本白皮书中找到如何让工业 APP 应用落地的抓\\n手,推进工业 APP 生态建设与发展。\\n\\nIX\\n目  录\\n编写说明 ..................................................................................................................... II\\n前  言 ................................................................................................................... VIII\\n目  录 ...................................................................................................................... IX\\n1 工业 APP 的概念与内涵 ................................................ 1\\n1.1 工业 APP 发展的背景 ..................................................................................... 1\\n1.2 工业 APP 的概念 ............................................................................................. 2\\n1.2.1 工业 APP 的定义 .............................................................................................. 2\\n1.2.2 工业 APP 的内涵 .............................................................................................. 4\\n1.2.3 工业 APP 的典型特征 ...................................................................................... 5\\n1.3 工业 APP 参考体系架构 ................................................................................. 7\\n1.4 概念辨析 ........................................................................................................... 9\\n1.4.1 工业 APP 与消费 APP 的区别 ....................................................................... 10\\n1.4.2 工业 APP 与工业软件的关系 ..........................................................................11"}'

debug log :

time=2025-02-26T17:31:48.140+08:00 level=DEBUG source=runner.go:742 msg="embedding request" content=" 府 可 以 从 白 皮 书 中 借 鉴 到 如 何 体 系 性 规 划 , 服 务 企 业 并 促 进 经 济 建 设 的 思 \\ 路 ; 工 业 企 业 可 以 了 解 头 部 企 业 在 工 业 app 开 发 、 应 用 、 技 术 转 化 等 方 面 \\ n 的 经 验 ; 平 台 企 业 可 以 从 白 皮 书 中 归 纳 分 析 出 自 己 平 台 的 发 展 方 向 与 应 用 \\ n 推 进 路 径 。 各 主 体 都 能 够 从 本 白 皮 书 中 找 到 如 何 让 工 业 app 应 用 落 地 的 抓 \\ n 手 , 推 进 工 业 app 生 态 建 设 与 发 展 。 \\ n \\ nix \\ n 目 录 \\ n 编 写 说 明 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii \\ n 前 言 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii \\ n 目 录 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix \\ n1 工 业 app 的 概 念 与 内 涵 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 \\ n1 . 1 工 业 app 发 展 的 背 景 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 \\ n1 . 2 工 业 app 的 概 念 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 \\ n1 . 2 . 1 工 业 app 的 定 义 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 \\ n1 . 2 . 2 工 业 app 的 内 涵 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 \\ n1 . 2 . 3 工 业 app 的 典 型 特 征 . . . . . . . ."
time=2025-02-26T17:31:48.141+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=1026 used=0 remaining=1026
C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.c:8374: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.c:8374: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
time=2025-02-26T17:31:50.491+08:00 level=ERROR source=routes.go:478 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:57292/embedding\": read tcp 127.0.0.1:57295->127.0.0.1:57292: wsarecv: An existing connection was forcibly closed by the remote host."

What I want to know is why spaces are automatically added between each character.
https://github.com/ollama/ollama/issues/7288#issuecomment-2503273262 Actually, this issue with spaces has also been reflected in the previous replies.

<!-- gh-comment-id:2684446075 --> @viosay commented on GitHub (Feb 26, 2025): Reconsidering this issue, when an exception occurs, the request input is completely normal, with characters properly segmented according to length, as shown in the example below. However, after reviewing the debug, it was found that spaces were added between each character in the output request content, which resulted in an increase in the number of tokens. The length exceeds the limit after adding spaces, causing an overflow and resulting in an error at 1026. ``` curl -s 192.168.7.210:11434/api/embed -d '{"model":"shaw/dmeta-embedding-zh","input":"府可以从白皮书中借鉴到如何体系性规划,服务企业并促进经济建设的思\\路;工业企业可以了解头部企业在工业 APP 开发、应用、技术转化等方面\\n的经验;平台企业可以从白皮书中归纳分析出自己平台的发展方向与应用\\n推进路径。各主体都能够从本白皮书中找到如何让工业 APP 应用落地的抓\\n手,推进工业 APP 生态建设与发展。\\n\\nIX\\n目 录\\n编写说明 ..................................................................................................................... II\\n前 言 ................................................................................................................... VIII\\n目 录 ...................................................................................................................... IX\\n1 工业 APP 的概念与内涵 ................................................ 1\\n1.1 工业 APP 发展的背景 ..................................................................................... 1\\n1.2 工业 APP 的概念 ............................................................................................. 2\\n1.2.1 工业 APP 的定义 .............................................................................................. 2\\n1.2.2 工业 APP 的内涵 .............................................................................................. 4\\n1.2.3 工业 APP 的典型特征 ...................................................................................... 5\\n1.3 工业 APP 参考体系架构 ................................................................................. 7\\n1.4 概念辨析 ........................................................................................................... 9\\n1.4.1 工业 APP 与消费 APP 的区别 ....................................................................... 10\\n1.4.2 工业 APP 与工业软件的关系 ..........................................................................11"}' ``` debug log : ``` time=2025-02-26T17:31:48.140+08:00 level=DEBUG source=runner.go:742 msg="embedding request" content=" 府 可 以 从 白 皮 书 中 借 鉴 到 如 何 体 系 性 规 划 , 服 务 企 业 并 促 进 经 济 建 设 的 思 \\ 路 ; 工 业 企 业 可 以 了 解 头 部 企 业 在 工 业 app 开 发 、 应 用 、 技 术 转 化 等 方 面 \\ n 的 经 验 ; 平 台 企 业 可 以 从 白 皮 书 中 归 纳 分 析 出 自 己 平 台 的 发 展 方 向 与 应 用 \\ n 推 进 路 径 。 各 主 体 都 能 够 从 本 白 皮 书 中 找 到 如 何 让 工 业 app 应 用 落 地 的 抓 \\ n 手 , 推 进 工 业 app 生 态 建 设 与 发 展 。 \\ n \\ nix \\ n 目 录 \\ n 编 写 说 明 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii \\ n 前 言 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii \\ n 目 录 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix \\ n1 工 业 app 的 概 念 与 内 涵 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 \\ n1 . 1 工 业 app 发 展 的 背 景 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 \\ n1 . 2 工 业 app 的 概 念 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 \\ n1 . 2 . 1 工 业 app 的 定 义 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 \\ n1 . 2 . 2 工 业 app 的 内 涵 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 \\ n1 . 2 . 3 工 业 app 的 典 型 特 征 . . . . . . . ." time=2025-02-26T17:31:48.141+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=1026 used=0 remaining=1026 C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.c:8374: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.c:8374: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed time=2025-02-26T17:31:50.491+08:00 level=ERROR source=routes.go:478 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:57292/embedding\": read tcp 127.0.0.1:57295->127.0.0.1:57292: wsarecv: An existing connection was forcibly closed by the remote host." ``` What I want to know is why spaces are automatically added between each character. https://github.com/ollama/ollama/issues/7288#issuecomment-2503273262 Actually, this issue with spaces has also been reflected in the previous replies.
Author
Owner

@rick-github commented on GitHub (Feb 26, 2025):

The data in the log are characters, not tokens. The padding is a function of the tokenizer table in shaw/dmeta-embedding-zh. The tokenizer uses sentencepiece and the spaces are represented internally as the special character "▁". In the process of truncating the input to make it fit in the context buffer, the input is tokenized and the detokenized, and the latter step results in the padding. However, the number of tokens is the same. For example, tokenizing "府" (the first glyph from your test input) returns a value of 2424. De-tokenzing the value of 2424 returns the glyph sequence " 府". These are both the same as far as the tokenizer is concerned:

$ diff -u  <(curl -s localhost:11434/api/embed -d '{"model":"shaw/dmeta-embedding-zh","input":"府"}' | jq) <(curl -s localhost:11434/api/embed -d '{"model":"shaw/dmeta-embedding-zh","input":" 府"}' | jq)
--- /dev/fd/63	2025-02-26 15:51:31.545786473 +0100
+++ /dev/fd/62	2025-02-26 15:51:31.545786473 +0100
@@ -772,7 +772,7 @@
       -0.007145582
     ]
   ],
-  "total_duration": 211954403,
-  "load_duration": 198647809,
+  "total_duration": 219156306,
+  "load_duration": 203917779,
   "prompt_eval_count": 1
 }
$ docker compose logs ollama | grep -A2 "embedding request"
ollama  | time=2025-02-26T14:55:08.161Z level=DEBUG source=runner.go:742 msg="embedding request" content=" 府"
ollama  | time=2025-02-26T14:55:08.161Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=3 prompt=3 used=0 remaining=3
ollama  | time=2025-02-26T14:55:08.180Z level=DEBUG source=sched.go:408 msg="context for request finished"
--
ollama  | time=2025-02-26T14:55:08.180Z level=DEBUG source=runner.go:742 msg="embedding request" content=府
ollama  | time=2025-02-26T14:55:08.180Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=3 prompt=3 used=0 remaining=3
ollama  | [GIN] 2025/02/26 - 14:55:08 | 200 |  228.520826ms |      172.19.0.1 | POST     "/api/embed"
<!-- gh-comment-id:2685315636 --> @rick-github commented on GitHub (Feb 26, 2025): The data in the log are characters, not tokens. The padding is a function of the tokenizer table in shaw/dmeta-embedding-zh. The tokenizer uses [sentencepiece](https://huggingface.co/docs/transformers/en/tokenizer_summary#sentencepiece) and the spaces are represented internally as the special character "▁". In the process of truncating the input to make it fit in the context buffer, the input is tokenized and the detokenized, and the latter step results in the padding. However, the number of tokens is the same. For example, tokenizing "府" (the first glyph from your test input) returns a value of 2424. De-tokenzing the value of 2424 returns the glyph sequence " 府". These are both the same as far as the tokenizer is concerned: ```console $ diff -u <(curl -s localhost:11434/api/embed -d '{"model":"shaw/dmeta-embedding-zh","input":"府"}' | jq) <(curl -s localhost:11434/api/embed -d '{"model":"shaw/dmeta-embedding-zh","input":" 府"}' | jq) --- /dev/fd/63 2025-02-26 15:51:31.545786473 +0100 +++ /dev/fd/62 2025-02-26 15:51:31.545786473 +0100 @@ -772,7 +772,7 @@ -0.007145582 ] ], - "total_duration": 211954403, - "load_duration": 198647809, + "total_duration": 219156306, + "load_duration": 203917779, "prompt_eval_count": 1 } ``` ```console $ docker compose logs ollama | grep -A2 "embedding request" ollama | time=2025-02-26T14:55:08.161Z level=DEBUG source=runner.go:742 msg="embedding request" content=" 府" ollama | time=2025-02-26T14:55:08.161Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=3 prompt=3 used=0 remaining=3 ollama | time=2025-02-26T14:55:08.180Z level=DEBUG source=sched.go:408 msg="context for request finished" -- ollama | time=2025-02-26T14:55:08.180Z level=DEBUG source=runner.go:742 msg="embedding request" content=府 ollama | time=2025-02-26T14:55:08.180Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=3 prompt=3 used=0 remaining=3 ollama | [GIN] 2025/02/26 - 14:55:08 | 200 | 228.520826ms | 172.19.0.1 | POST "/api/embed" ```
Author
Owner

@viosay commented on GitHub (Feb 27, 2025):

@rick-github Thank you, I think I understand now.

<!-- gh-comment-id:2686531723 --> @viosay commented on GitHub (Feb 27, 2025): @rick-github Thank you, I think I understand now.
Author
Owner

@leodeslf commented on GitHub (Mar 24, 2025):

Just in case...

It happened to me. My specific problem was that the default value for num_ctx in llamaindex exceeded that of the model I was trying to use.

// E.g.:
Settings.embedModel = new OllamaEmbedding({
  model: 'granite-embedding:278m',
  options: {
    num_ctx: 512, // <-- This fixed my issue.
  },
});

If you are here, you probably want to check out if there are conflicts between the defaults of whatever tool you are using and those of your specific model.

<!-- gh-comment-id:2749484055 --> @leodeslf commented on GitHub (Mar 24, 2025): Just in case... It happened to me. My specific problem was that the default value for `num_ctx` in `llamaindex` exceeded that of the model I was trying to use. ```ts // E.g.: Settings.embedModel = new OllamaEmbedding({ model: 'granite-embedding:278m', options: { num_ctx: 512, // <-- This fixed my issue. }, }); ``` If you are here, you probably want to check out if there are conflicts between the defaults of whatever tool you are using and those of your specific model.
Author
Owner

@rick-github commented on GitHub (Mar 24, 2025):

Now that the switch to 0.6 has happened and the new runner architecture looks a bit stable, I will look at creating a PR to fix this.

<!-- gh-comment-id:2749490546 --> @rick-github commented on GitHub (Mar 24, 2025): Now that the switch to 0.6 has happened and the new runner architecture looks a bit stable, I will look at creating a PR to fix this.
Author
Owner

@wikty commented on GitHub (Apr 13, 2025):

I'm the maintainer of dmeta-embedding-zh, compatibility issue with ollama 0.6.x has been fixed, please re-download the model:

ollama rm shaw/dmeta-embedding-zh

ollama pull shaw/dmeta-embedding-zh
<!-- gh-comment-id:2800017350 --> @wikty commented on GitHub (Apr 13, 2025): I'm the maintainer of dmeta-embedding-zh, compatibility issue with ollama 0.6.x has been fixed, please re-download the model: ``` ollama rm shaw/dmeta-embedding-zh ollama pull shaw/dmeta-embedding-zh ```
Author
Owner

@ynott commented on GitHub (Apr 7, 2026):

Adding a data point for posterity (low priority)

Hit what looks like the same family of bug on Ollama v0.20.2 with jeffh/intfloat-multilingual-e5-small:q8_0 (BERT/XLM-R, GGUF Q8_0). Not asking for action — just leaving a record in case someone else lands here from a search.

Symptom

POST /api/embed (and /api/embeddings) crashes the runner with:

/ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:4625: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
signal: aborted (core dumped)

The runner is started with --ollama-engine (new engine path).

It's input-dependent, not "Japanese breaks it"

I wrote a small repro script(gist) that runs 17 single-string inputs against /api/embed. Same model, same endpoint, no batching:

Label                          | Input                     | Result
-------------------------------|---------------------------|-------
ASCII single char              | .                         | ok
ASCII two chars                | ab                        | ok
ASCII word                     | est                       | ok
Single katakana 'te'           | テ                        | ok
Single katakana 'su'           | ス                        | ok
Katakana 'suta'                | スタ                       | ok
Katakana 'tesu'                | テス                       | CRASH
Katakana 'kana'                | カナ                       | CRASH
Single hiragana 'a'            | あ                        | ok
Hiragana 'ai'                  | あい                       | ok
Kanji two chars 'nihon'        | 日本                       | ok
Kanji three 'nihongo'          | 日本語                     | ok
Ideographic full stop          | 。                         | CRASH
Ideographic comma              | 、                         | CRASH
Japanese sentence              | テスト文章です。            | CRASH
e5 prefixed sentence           | query: テスト文章です。      | CRASH
Spaced ideographs              | 日 本 語                   | ok

So it's not "multibyte input", input length, or leading whitespace. Single-character katakana like and work fine, but the 2-character テス and カナ crash. Single kanji and hiragana of any length I tested work. Inputs containing or always crash.

Best guess: SentencePiece is producing certain merged-token IDs (e.g. ▁テス, ▁カナ, ▁。) that fall outside the valid embedding-table index range under the new engine, while single-character tokens or other merge paths stay within range. Notably, ollama run "テスト文章です。" works fine with the same model — only /api/embed crashes — which suggests the issue is in the embedding-specific code path rather than the tokenizer in general.

Same script against bge-m3 on the same Ollama instance

  Total tests : 17
  OK          : 17
  Crashed     : 0

All inputs returned embeddings successfully. This model is not affected.

So the issue is specific to this particular model + new engine combination, not the embedding code path as a whole.

Things that did NOT help

  • num_ctx: 512 via Modelfile (the workaround that helped in #8431)
  • truncate: false, keep_alive: 0, options.num_ctx: 512 in the request body
  • OLLAMA_NEW_ENGINE=false env var (no longer respected in v0.20.2 — runner still launches with --ollama-engine)
  • Using the legacy /api/embeddings endpoint instead of /api/embed

What did help

Switching to bge-m3 (official Ollama library). Same endpoint, same Ollama version, same hardware, same exact inputs — clean embeddings every time, including batched requests.

Environment

  • Ollama v0.20.2, --ollama-engine runner
  • Linux, NVIDIA RTX 2070, driver 580.126.20, CUDA 13
  • Affected model: jeffh/intfloat-multilingual-e5-small:q8_0 (architecture: bert, n_ctx_train: 512, embedding length: 384, Q8_0)
  • Working model: bge-m3

#7288 since this looks like part of the same broader cluster of /api/embed + GGML_ASSERT(i01 >= 0 && i01 < ne01) regressions reported here over multiple versions. Workaround for anyone hitting this from a search: try a different embedding model (e.g. bge-m3) before deeper debugging.

<!-- gh-comment-id:4196422610 --> @ynott commented on GitHub (Apr 7, 2026): **Adding a data point for posterity (low priority)** Hit what looks like the same family of bug on Ollama v0.20.2 with `jeffh/intfloat-multilingual-e5-small:q8_0` (BERT/XLM-R, GGUF Q8_0). Not asking for action — just leaving a record in case someone else lands here from a search. ## Symptom `POST /api/embed` (and `/api/embeddings`) crashes the runner with: ``` /ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:4625: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed signal: aborted (core dumped) ``` The runner is started with `--ollama-engine` (new engine path). ## It's input-dependent, not "Japanese breaks it" I wrote a [small repro script(gist)](https://gist.github.com/ynott/91f80510a3ae31d84fe1be35cc1747d6) that runs 17 single-string inputs against `/api/embed`. Same model, same endpoint, no batching: ``` Label | Input | Result -------------------------------|---------------------------|------- ASCII single char | . | ok ASCII two chars | ab | ok ASCII word | est | ok Single katakana 'te' | テ | ok Single katakana 'su' | ス | ok Katakana 'suta' | スタ | ok Katakana 'tesu' | テス | CRASH Katakana 'kana' | カナ | CRASH Single hiragana 'a' | あ | ok Hiragana 'ai' | あい | ok Kanji two chars 'nihon' | 日本 | ok Kanji three 'nihongo' | 日本語 | ok Ideographic full stop | 。 | CRASH Ideographic comma | 、 | CRASH Japanese sentence | テスト文章です。 | CRASH e5 prefixed sentence | query: テスト文章です。 | CRASH Spaced ideographs | 日 本 語 | ok ``` So it's not "multibyte input", input length, or leading whitespace. Single-character katakana like `ス` and `テ` work fine, but the 2-character `テス` and `カナ` crash. Single kanji and hiragana of any length I tested work. Inputs containing `。` or `、` always crash. Best guess: SentencePiece is producing certain merged-token IDs (e.g. `▁テス`, `▁カナ`, `▁。`) that fall outside the valid embedding-table index range under the new engine, while single-character tokens or other merge paths stay within range. Notably, `ollama run "テスト文章です。"` works fine with the same model — only `/api/embed` crashes — which suggests the issue is in the embedding-specific code path rather than the tokenizer in general. ## Same script against `bge-m3` on the same Ollama instance ``` Total tests : 17 OK : 17 Crashed : 0 All inputs returned embeddings successfully. This model is not affected. ``` So the issue is specific to this particular model + new engine combination, not the embedding code path as a whole. ## Things that did NOT help - `num_ctx: 512` via Modelfile (the workaround that helped in #8431) - `truncate: false`, `keep_alive: 0`, `options.num_ctx: 512` in the request body - `OLLAMA_NEW_ENGINE=false` env var (no longer respected in v0.20.2 — runner still launches with `--ollama-engine`) - Using the legacy `/api/embeddings` endpoint instead of `/api/embed` ## What did help Switching to `bge-m3` (official Ollama library). Same endpoint, same Ollama version, same hardware, same exact inputs — clean embeddings every time, including batched requests. ## Environment - Ollama v0.20.2, `--ollama-engine` runner - Linux, NVIDIA RTX 2070, driver 580.126.20, CUDA 13 - Affected model: `jeffh/intfloat-multilingual-e5-small:q8_0` (architecture: bert, n_ctx_train: 512, embedding length: 384, Q8_0) - Working model: `bge-m3` #7288 since this looks like part of the same broader cluster of `/api/embed` + `GGML_ASSERT(i01 >= 0 && i01 < ne01)` regressions reported here over multiple versions. **Workaround for anyone hitting this from a search: try a different embedding model (e.g. `bge-m3`) before deeper debugging.**
Author
Owner

@joaquinariasco-lab commented on GitHub (Apr 17, 2026):

Can you share a minimal reproducible example including: the exact “multiple fragments” payload you send (array vs concatenated string), the approximate token/character length of each fragment, the model + version used, and your truncate / context length settings, so we can determine whether the GGML_ASSERT failure is triggered by batching behavior or by exceeding the model’s embedding window?

<!-- gh-comment-id:4267922214 --> @joaquinariasco-lab commented on GitHub (Apr 17, 2026): Can you share a minimal reproducible example including: the exact “multiple fragments” payload you send (array vs concatenated string), the approximate token/character length of each fragment, the model + version used, and your truncate / context length settings, so we can determine whether the GGML_ASSERT failure is triggered by batching behavior or by exceeding the model’s embedding window?
Author
Owner

@ynott commented on GitHub (Apr 23, 2026):

@joaquinariasco-lab Thank you for the follow-up.

Environment:

  • Ollama: 0.20.2
  • Model: jeffh/intfloat-multilingual-e5-small:q8_0
  • Endpoint: /api/embed
  • Linux

Below is the simplest command to reproduce the issue:

$ ollama pull jeffh/intfloat-multilingual-e5-small:q8_0
$ curl -sS http://127.0.0.1:11434/api/embed \
-H 'Content-Type: application/json' \
-d '{"model":"jeffh/intfloat-multilingual-e5-small:q8_0","input":"テス"}'
Image

This fails on my side.
For comparison, this succeeds:

$ curl -sS http://127.0.0.1:11434/api/embed \
-H 'Content-Type: application/json' \
-d '{"model":"jeffh/intfloat-multilingual-e5-small:q8_0","input":"テ"}'
Image

I also tested a multiple-fragments payload (failed):

$ curl -sS http://127.0.0.1:11434/api/embed \
  -H 'Content-Type: application/json' \
  -d '{"model":"jeffh/intfloat-multilingual-e5-small:q8_0","input":["テス","カナ"]}'
Image

I also tested the same inputs with the Ollama CLI:

$ ollama run jeffh/intfloat-multilingual-e5-small:q8_0 "テス"
Image - Failed
$ ollama run jeffh/intfloat-multilingual-e5-small:q8_0 "テ"
Image - Success
$ ollama run jeffh/intfloat-multilingual-e5-small:q8_0 "テスト"
Image - Success

Observed results:

  • ["テス","カナ"] -> FAIL
  • "テス" -> FAIL
  • "カナ" -> FAIL
  • "テ" -> OK
  • "ス" -> OK
  • "テスト" -> OK

Approximate lengths:

  • "テス": 2 chars / 6 bytes
  • "カナ": 2 chars / 6 bytes

This does reproduce with a multiple-fragment payload, but also with very short single-string inputs. For this reason, this reproduction does not appear to be specific to batching, nor does it appear to be due to exceeding the embedding window.

I have uploaded the debug log here:
https://gist.github.com/ynott/8a9ad624e8aae8dfcb8221a343e4030b

Relevant details from the failing "テス" case:

  • Loading cache slot ... prompt=3
  • GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
  • runner.num_ctx=4096
  • signal: aborted (core dumped)
<!-- gh-comment-id:4302459965 --> @ynott commented on GitHub (Apr 23, 2026): @joaquinariasco-lab Thank you for the follow-up. Environment: - Ollama: 0.20.2 - Model: jeffh/intfloat-multilingual-e5-small:q8_0 - Endpoint: /api/embed - Linux Below is the simplest command to reproduce the issue: ```bash $ ollama pull jeffh/intfloat-multilingual-e5-small:q8_0 $ curl -sS http://127.0.0.1:11434/api/embed \ -H 'Content-Type: application/json' \ -d '{"model":"jeffh/intfloat-multilingual-e5-small:q8_0","input":"テス"}' ``` <img width="997" height="92" alt="Image" src="https://github.com/user-attachments/assets/4ef05f25-7829-4e6e-bbea-bb33925a1bb6" /> This fails on my side. For comparison, this succeeds: ```bash $ curl -sS http://127.0.0.1:11434/api/embed \ -H 'Content-Type: application/json' \ -d '{"model":"jeffh/intfloat-multilingual-e5-small:q8_0","input":"テ"}' ``` <img width="1516" height="100" alt="Image" src="https://github.com/user-attachments/assets/0654138f-5df8-4219-818f-aedf3bce2b39" /> I also tested a multiple-fragments payload (failed): ```bash $ curl -sS http://127.0.0.1:11434/api/embed \ -H 'Content-Type: application/json' \ -d '{"model":"jeffh/intfloat-multilingual-e5-small:q8_0","input":["テス","カナ"]}' ``` <img width="1515" height="73" alt="Image" src="https://github.com/user-attachments/assets/215903d2-f472-470f-89a4-055b72d55a6d" /> I also tested the same inputs with the Ollama CLI: ``` $ ollama run jeffh/intfloat-multilingual-e5-small:q8_0 "テス" ``` <img width="799" height="48" alt="Image" src="https://github.com/user-attachments/assets/0f933373-2a38-4283-a588-993a1e2a6fdd" /> - Failed ```bash $ ollama run jeffh/intfloat-multilingual-e5-small:q8_0 "テ" ``` <img width="874" height="49" alt="Image" src="https://github.com/user-attachments/assets/6d770807-ce0b-4735-8cee-6109b86866bd" /> - Success ```bash $ ollama run jeffh/intfloat-multilingual-e5-small:q8_0 "テスト" ``` <img width="861" height="49" alt="Image" src="https://github.com/user-attachments/assets/7d9ff645-6b7d-466c-9ea6-bce7b91878f7" /> - Success Observed results: - [&quot;テス&quot;,&quot;カナ&quot;] -> FAIL - &quot;テス&quot; -> FAIL - &quot;カナ&quot; -> FAIL - &quot;テ&quot; -> OK - &quot;ス&quot; -> OK - &quot;テスト&quot; -> OK Approximate lengths: - &quot;テス&quot;: 2 chars / 6 bytes - &quot;カナ&quot;: 2 chars / 6 bytes This does reproduce with a multiple-fragment payload, but also with very short single-string inputs. For this reason, this reproduction does not appear to be specific to batching, nor does it appear to be due to exceeding the embedding window. I have uploaded the debug log here: https://gist.github.com/ynott/8a9ad624e8aae8dfcb8221a343e4030b Relevant details from the failing &quot;テス&quot; case: - Loading cache slot ... prompt=3 - GGML_ASSERT(i01 &gt;= 0 &amp;&amp; i01 &lt; ne01) failed - runner.num_ctx=4096 - signal: aborted (core dumped)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66688