[GH-ISSUE #7288] embedding generation failed. wsarecv: An existing connection was forcibly closed by the remote host. #51143

New Issue

GiteaMirror · 2026-04-28T18:32:21-05:00

GiteaMirror commented

2026-04-28 18:32:21 -05:00

Originally created by @viosay on GitHub (Oct 21, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7288

What is the issue?

embedding model
When I submit a single fragment, it responds normally, but when I submit multiple fragments, an exception occurs.
I encountered this error on different Windows systems as well.
This issue occurs in both versions 0.3.14 and 0.4.0-rc3. However, I also tested versions 0.3.13 and 0.3.10, and they work perfectly.

[GIN] 2024/10/21 - 16:00:29 | 200 |    722.8624ms |   192.168.7.100 | POST     "/api/embed"
ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
time=2024-10-21T16:00:36.434+08:00 level=ERROR source=routes.go:434 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:64075/embedding\": read tcp 127.0.0.1:64078->127.0.0.1:64075: wsarecv: An existing connection was forcibly closed by the remote host."
[GIN] 2024/10/21 - 16:00:36 | 500 |    6.5660285s |   192.168.7.100 | POST     "/api/embed"
time=2024-10-21T16:01:00.723+08:00 level=INFO source=llama-server.go:72 msg="system memory" total="15.9 GiB" free="10.3 GiB" free_swap="8.8 GiB"
time=2024-10-21T16:01:00.726+08:00 level=INFO source=memory.go:346 msg="offload to cpu" layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="687.0 MiB" memory.required.partial="0 B" memory.required.kv="12.0 MiB" memory.required.allocations="[687.0 MiB]" memory.weights.total="589.2 MiB" memory.weights.repeating="548.0 MiB" memory.weights.nonrepeating="41.3 MiB" memory.graph.full="32.0 MiB" memory.graph.partial="32.0 MiB"
time=2024-10-21T16:01:00.730+08:00 level=INFO source=llama-server.go:355 msg="starting llama server" cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 --ctx-size 2048 --batch-size 512 --embedding --threads 4 --no-mmap --parallel 1 --port 64090"
time=2024-10-21T16:01:00.782+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2024-10-21T16:01:00.791+08:00 level=INFO source=llama-server.go:534 msg="waiting for llama runner to start responding"
time=2024-10-21T16:01:00.792+08:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server error"
time=2024-10-21T16:01:00.812+08:00 level=INFO source=runner.go:856 msg="starting go runner"
time=2024-10-21T16:01:00.829+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:64090"
llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from C:\Users\Administrator\.ollama\models\blobs\shp4 _llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = model
llama_model_loader: - kv   2:                           bert.block_count u32              = 24
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,21128]   = ["[PAD]", "[unused1]", "[unused2]",  0 _llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,21128]   = [-1000.000000, -1000.000000, -1000.0 - _llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,21128]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,P3 _llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:  243 tensors
llama_model_loader: - type  f16:  146 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.0769 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 21128
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 512
llm_load_print_meta: n_embd           = 1024
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-012
llm_load_print_meta: f_norm_rms_eps   = 0.0e+000
llm_load_print_meta: f_clamp_kqv      = 0.0e+000
llm_load_print_meta: f_max_alibi_bias = 0.0e+000
llm_load_print_meta: f_logit_scale    = 0.0e+000
llm_load_print_meta: n_ff             = 4096
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 512
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 335M
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 324.47 M
llm_load_print_meta: model size       = 619.50 MiB (16.02 BPW) 
llm_load_print_meta: general.name     = model
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
llm_load_print_meta: EOG token        = 102 '[SEP]'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.16 MiB
llm_load_tensors:        CPU buffer size =   619.50 MiB
time=2024-10-21T16:01:01.048+08:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    26.00 MiB
llama_new_context_with_model: graph nodes  = 851
llama_new_context_with_model: graph splits = 1
time=2024-10-21T16:01:01.299+08:00 level=INFO source=llama-server.go:573 msg="llama runner started in 0.51 seconds"
llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from C:\Users\Administrator\.ollama\models\blobs\sh c s3llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = model
llama_model_loader: - kv   2:                           bert.block_count u32              = 24
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,21128]   = ["[PAD]", "[unused1]", "[unused2]", `w s3llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,21128]   = [-1000.000000, -1000.000000, -1000.0 ~ s3llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,21128]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, u s3llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:  243 tensors
llama_model_loader: - type  f16:  146 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.0769 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 21128
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 1
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 324.47 M
llm_load_print_meta: model size       = 619.50 MiB (16.02 BPW) 
llm_load_print_meta: general.name     = model
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
llm_load_print_meta: EOG token        = 102 '[SEP]'
llm_load_print_meta: max token length = 48
llama_model_load: vocab only - skipping tensors
[GIN] 2024/10/21 - 16:01:01 | 200 |    701.8355ms |   192.168.7.100 | POST     "/api/embed"
ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
time=2024-10-21T16:01:08.177+08:00 level=ERROR source=routes.go:434 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:64090/embedding\": read tcp 127.0.0.1:64093->127.0.0.1:64090: wsarecv: An existing connection was forcibly closed by the remote host."

OS

Windows

GPU

No response

CPU

Intel

Ollama version

0.3.14~0.4.6

Originally created by @viosay on GitHub (Oct 21, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7288 ### What is the issue? embedding model When I submit a single fragment, it responds normally, but when I submit multiple fragments, an exception occurs. I encountered this error on different Windows systems as well. This issue occurs in both versions 0.3.14 and 0.4.0-rc3. However, I also tested versions 0.3.13 and 0.3.10, and they work perfectly. ``` [GIN] 2024/10/21 - 16:00:29 | 200 | 722.8624ms | 192.168.7.100 | POST "/api/embed" ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed time=2024-10-21T16:00:36.434+08:00 level=ERROR source=routes.go:434 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:64075/embedding\": read tcp 127.0.0.1:64078->127.0.0.1:64075: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2024/10/21 - 16:00:36 | 500 | 6.5660285s | 192.168.7.100 | POST "/api/embed" time=2024-10-21T16:01:00.723+08:00 level=INFO source=llama-server.go:72 msg="system memory" total="15.9 GiB" free="10.3 GiB" free_swap="8.8 GiB" time=2024-10-21T16:01:00.726+08:00 level=INFO source=memory.go:346 msg="offload to cpu" layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="687.0 MiB" memory.required.partial="0 B" memory.required.kv="12.0 MiB" memory.required.allocations="[687.0 MiB]" memory.weights.total="589.2 MiB" memory.weights.repeating="548.0 MiB" memory.weights.nonrepeating="41.3 MiB" memory.graph.full="32.0 MiB" memory.graph.partial="32.0 MiB" time=2024-10-21T16:01:00.730+08:00 level=INFO source=llama-server.go:355 msg="starting llama server" cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 --ctx-size 2048 --batch-size 512 --embedding --threads 4 --no-mmap --parallel 1 --port 64090" time=2024-10-21T16:01:00.782+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2024-10-21T16:01:00.791+08:00 level=INFO source=llama-server.go:534 msg="waiting for llama runner to start responding" time=2024-10-21T16:01:00.792+08:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server error" time=2024-10-21T16:01:00.812+08:00 level=INFO source=runner.go:856 msg="starting go runner" time=2024-10-21T16:01:00.829+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:64090" llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from C:\Users\Administrator\.ollama\models\blobs\shp4 _llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = model llama_model_loader: - kv 2: bert.block_count u32 = 24 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 1024 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 4096 llama_model_loader: - kv 6: bert.attention.head_count u32 = 16 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,21128] = ["[PAD]", "[unused1]", "[unused2]", 0 _llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,21128] = [-1000.000000, -1000.000000, -1000.0 - _llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,21128] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,P3 _llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 243 tensors llama_model_loader: - type f16: 146 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 5 llm_load_vocab: token to piece cache size = 0.0769 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 21128 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 1024 llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 1.0e-012 llm_load_print_meta: f_norm_rms_eps = 0.0e+000 llm_load_print_meta: f_clamp_kqv = 0.0e+000 llm_load_print_meta: f_max_alibi_bias = 0.0e+000 llm_load_print_meta: f_logit_scale = 0.0e+000 llm_load_print_meta: n_ff = 4096 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 335M llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 324.47 M llm_load_print_meta: model size = 619.50 MiB (16.02 BPW) llm_load_print_meta: general.name = model llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' llm_load_print_meta: EOG token = 102 '[SEP]' llm_load_print_meta: max token length = 48 llm_load_tensors: ggml ctx size = 0.16 MiB llm_load_tensors: CPU buffer size = 619.50 MiB time=2024-10-21T16:01:01.048+08:00 level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server loading model" llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 192.00 MiB llama_new_context_with_model: KV self size = 192.00 MiB, K (f16): 96.00 MiB, V (f16): 96.00 MiB llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: CPU compute buffer size = 26.00 MiB llama_new_context_with_model: graph nodes = 851 llama_new_context_with_model: graph splits = 1 time=2024-10-21T16:01:01.299+08:00 level=INFO source=llama-server.go:573 msg="llama runner started in 0.51 seconds" llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from C:\Users\Administrator\.ollama\models\blobs\sh c s3llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = model llama_model_loader: - kv 2: bert.block_count u32 = 24 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 1024 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 4096 llama_model_loader: - kv 6: bert.attention.head_count u32 = 16 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,21128] = ["[PAD]", "[unused1]", "[unused2]", `w s3llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,21128] = [-1000.000000, -1000.000000, -1000.0 ~ s3llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,21128] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, u s3llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 243 tensors llama_model_loader: - type f16: 146 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 5 llm_load_vocab: token to piece cache size = 0.0769 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 21128 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 324.47 M llm_load_print_meta: model size = 619.50 MiB (16.02 BPW) llm_load_print_meta: general.name = model llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' llm_load_print_meta: EOG token = 102 '[SEP]' llm_load_print_meta: max token length = 48 llama_model_load: vocab only - skipping tensors [GIN] 2024/10/21 - 16:01:01 | 200 | 701.8355ms | 192.168.7.100 | POST "/api/embed" ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed time=2024-10-21T16:01:08.177+08:00 level=ERROR source=routes.go:434 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:64090/embedding\": read tcp 127.0.0.1:64093->127.0.0.1:64090: wsarecv: An existing connection was forcibly closed by the remote host." ``` ### OS Windows ### GPU _No response_ ### CPU Intel ### Ollama version 0.3.14~0.4.6

GiteaMirror added the bug label 2026-04-28 18:32:21 -05:00

GiteaMirror commented

2026-04-28 18:32:31 -05:00

@rick-github commented on GitHub (Oct 21, 2024):

Which model are you using?

@rick-github commented on GitHub (Oct 21, 2024): Which model are you using?

GiteaMirror commented

2026-04-28 18:32:32 -05:00

@viosay commented on GitHub (Oct 21, 2024):

Which model are you using?

I tried many embedding models, including 893379029/piccolo-large-zh-v2 and viosay/conan-embedding-v1, and they all have the same issue, although they worked perfectly fine before. However, a few models, like shaw/dmeta-embedding-zh, do not have this problem.

@viosay commented on GitHub (Oct 21, 2024): > Which model are you using? I tried many embedding models, including 893379029/piccolo-large-zh-v2 and viosay/conan-embedding-v1, and they all have the same issue, although they worked perfectly fine before. However, a few models, like shaw/dmeta-embedding-zh, do not have this problem.

GiteaMirror commented

2026-04-28 18:32:33 -05:00

@rick-github commented on GitHub (Oct 21, 2024):

I am unable to replicate:

$ curl -s http://localhost:11434/api/version
{"version":"0.3.14"}
$ curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":"Why is the sky blue?"}' | jq '.embeddings=[.embeddings[]|length]'
{
  "model": "viosay/conan-embedding-v1",
  "embeddings": [
    1024
  ],
  "total_duration": 154636185,
  "load_duration": 2031329,
  "prompt_eval_count": 6
}
$ curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":["Why is the sky blue?","why is the grass green"]}' | jq '.embeddings=[.embeddings[]|length]'
{
  "model": "viosay/conan-embedding-v1",
  "embeddings": [
    1024,
    1024
  ],
  "total_duration": 306792338,
  "load_duration": 2301978,
  "prompt_eval_count": 13
}
$ curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":["Why is the sky blue?","why is the grass green","one","two","three"]}' | jq '.embeddings=[.embe
ddings[]|length]'
{
  "model": "viosay/conan-embedding-v1",
  "embeddings": [
    1024,
    1024,
    1024,
    1024,
    1024
  ],
  "total_duration": 628749251,
  "load_duration": 3847279,
  "prompt_eval_count": 16
}

It might have something to with the client or the length of the inputs. Can you provide more context on your usage, or better yet, a script that demonstrates the problem.

@rick-github commented on GitHub (Oct 21, 2024): I am unable to replicate: ```console $ curl -s http://localhost:11434/api/version {"version":"0.3.14"} $ curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":"Why is the sky blue?"}' | jq '.embeddings=[.embeddings[]|length]' { "model": "viosay/conan-embedding-v1", "embeddings": [ 1024 ], "total_duration": 154636185, "load_duration": 2031329, "prompt_eval_count": 6 } $ curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":["Why is the sky blue?","why is the grass green"]}' | jq '.embeddings=[.embeddings[]|length]' { "model": "viosay/conan-embedding-v1", "embeddings": [ 1024, 1024 ], "total_duration": 306792338, "load_duration": 2301978, "prompt_eval_count": 13 } $ curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":["Why is the sky blue?","why is the grass green","one","two","three"]}' | jq '.embeddings=[.embe ddings[]|length]' { "model": "viosay/conan-embedding-v1", "embeddings": [ 1024, 1024, 1024, 1024, 1024 ], "total_duration": 628749251, "load_duration": 3847279, "prompt_eval_count": 16 } ``` It might have something to with the client or the length of the inputs. Can you provide more context on your usage, or better yet, a script that demonstrates the problem.

GiteaMirror commented

2026-04-28 18:32:33 -05:00

@viosay commented on GitHub (Oct 22, 2024):

@rick-github You're right. Based on my tests, the issue is indeed related to the input length. When it exceeds a certain length, an error occurs.
It’s like the example below, where an error occurred.

curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}'

@viosay commented on GitHub (Oct 22, 2024): @rick-github You're right. Based on my tests, the issue is indeed related to the input length. When it exceeds a certain length, an error occurs. It’s like the example below, where an error occurred. ``` curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' ```

GiteaMirror commented

2026-04-28 18:32:34 -05:00

@rick-github commented on GitHub (Oct 22, 2024):

viosay/conan-embedding-v1 has an embedding length of 1024 and your test text is 1905 bytes, so it's exceeding the window. The client should chunk the text to segments smaller than the embedding length otherwise the returned embeddings will be missing semantic content. However, ollama (or actually llama.cpp) should handle the situation more gracefully.

@rick-github commented on GitHub (Oct 22, 2024): viosay/conan-embedding-v1 has an embedding length of 1024 and your test text is 1905 bytes, so it's exceeding the window. The client should chunk the text to segments smaller than the embedding length otherwise the returned embeddings will be missing semantic content. However, ollama (or actually llama.cpp) should handle the situation more gracefully.

GiteaMirror commented

2026-04-28 18:32:35 -05:00

@viosay commented on GitHub (Oct 22, 2024):

viosay/conan-embedding-v1 has an embedding length of 1024 and your test text is 1905 bytes, so it's exceeding the window. The client should chunk the text to segments smaller than the embedding length otherwise the returned embeddings will be missing semantic content. However, ollama (or actually llama.cpp) should handle the situation more gracefully.

Thank you for your response! The key issue is that this problem did not exist in versions prior to Ollama 0.3.14. I had been using various models without any issues before that.

@viosay commented on GitHub (Oct 22, 2024): > viosay/conan-embedding-v1 has an embedding length of 1024 and your test text is 1905 bytes, so it's exceeding the window. The client should chunk the text to segments smaller than the embedding length otherwise the returned embeddings will be missing semantic content. However, ollama (or actually llama.cpp) should handle the situation more gracefully. Thank you for your response! The key issue is that this problem did not exist in versions prior to Ollama 0.3.14. I had been using various models without any issues before that.

GiteaMirror commented

2026-04-28 18:32:36 -05:00

@viosay commented on GitHub (Oct 22, 2024):

For example, the shaw/dmeta-embedding-zh model, which has an embedding length of 768, does not encounter this issue. Both the new and old versions do not have this issue.

@viosay commented on GitHub (Oct 22, 2024): For example, the shaw/dmeta-embedding-zh model, which has an embedding length of 768, does not encounter this issue. Both the new and old versions do not have this issue.

GiteaMirror commented

2026-04-28 18:32:36 -05:00

@rick-github commented on GitHub (Oct 22, 2024):

ollama moved to a more recent llama.cpp snapshot for the granite model support (f2890a4494) and presumably that has introduced some problems with embedding calls. I don't see any recent issues regarding that in the llama.cpp issue tracker, so this is not affecting too many users.

Exceeding the length doesn't mean that all models will fail, some may be more resilient. I'll dig a bit more and file an issue with llama.cpp if this is the actual problem. In the meantime, you should adjust the text chunking anyway, as the embeddings will not contain all of the information in the original text.

@rick-github commented on GitHub (Oct 22, 2024): ollama moved to a more recent llama.cpp snapshot for the granite model support (https://github.com/ollama/ollama/commit/f2890a4494f9fb3722ee7a4c506252362d1eab65) and presumably that has introduced some problems with embedding calls. I don't see any recent issues regarding that in the llama.cpp issue tracker, so this is not affecting too many users. Exceeding the length doesn't mean that all models will fail, some may be more resilient. I'll dig a bit more and file an issue with llama.cpp if this is the actual problem. In the meantime, you should adjust the text chunking anyway, as the embeddings will not contain all of the information in the original text.

GiteaMirror commented

2026-04-28 18:32:38 -05:00

@viosay commented on GitHub (Oct 22, 2024):

@rick-github Thank you very much! I will follow your advice. One more thing I noticed is that most of the models that encounter issues are those imported after being converted to GGUF using the convert_hf_to_gguf.py script from llama.cpp. I'm not sure if this is the cause of the problem.

@viosay commented on GitHub (Oct 22, 2024): @rick-github Thank you very much! I will follow your advice. One more thing I noticed is that most of the models that encounter issues are those imported after being converted to GGUF using the `convert_hf_to_gguf.py` script from llama.cpp. I'm not sure if this is the cause of the problem.

GiteaMirror commented

2026-04-28 18:32:39 -05:00

@rick-github commented on GitHub (Oct 22, 2024):

Just to correct a mistake I made, viosay/conan-embedding-v1 has a limit of 512 tokens, and shaw/dmeta-embedding-zh a limit of 1024 tokens. The embedding length is the size of the generated embeddings.

@rick-github commented on GitHub (Oct 22, 2024): Just to correct a mistake I made, viosay/conan-embedding-v1 has a limit of 512 tokens, and shaw/dmeta-embedding-zh a limit of 1024 tokens. The embedding length is the size of the generated embeddings.

GiteaMirror commented

2026-04-28 18:32:40 -05:00

@viosay commented on GitHub (Oct 22, 2024):

Yes, I understand the meaning of embedding length. Whether it’s 512 or 1024, they are both less than 1905. This is a puzzling issue, and as you mentioned, it seems to have arisen after Ollama updated llama.cpp. I'll continue testing and verifying the specific situation. Thank you!

@viosay commented on GitHub (Oct 22, 2024): Yes, I understand the meaning of embedding length. Whether it’s 512 or 1024, they are both less than 1905. This is a puzzling issue, and as you mentioned, it seems to have arisen after Ollama updated llama.cpp. I'll continue testing and verifying the specific situation. Thank you!

GiteaMirror commented

2026-04-28 18:32:44 -05:00

@rick-github commented on GitHub (Oct 22, 2024):

Tokens are different to characters. A token is a sequence of characters, on average 2 or 3 characters in length. So a token length of 512 would handle 1024-1536 characters, and a token length of 1024 would handle 2048-3072 characters.

@rick-github commented on GitHub (Oct 22, 2024): Tokens are different to characters. A token is a sequence of characters, on average 2 or 3 characters in length. So a token length of 512 would handle 1024-1536 characters, and a token length of 1024 would handle 2048-3072 characters.

GiteaMirror commented

2026-04-28 18:32:47 -05:00

@viosay commented on GitHub (Oct 25, 2024):

I think I’ve figured out the issue; it seems that the truncate didn’t take effect.

@viosay commented on GitHub (Oct 25, 2024): I think I’ve figured out the issue; it seems that the `truncate` didn’t take effect. ![image](https://github.com/user-attachments/assets/1d1606c2-62a6-4ab6-b445-2ee6c95e4c99)

GiteaMirror commented

2026-04-28 18:32:47 -05:00

@rick-github commented on GitHub (Oct 25, 2024):

It does truncate, it's just the runner throws an GGML_ASSERT(i01 >= 0 && i01 < ne01) failed exception and crashes when the number of tokens is close to the maximum allowed and the runner has been started with a context window greater than the actual supported value. If the model is loaded with the context size set to the actual supported context size, it works fine:

curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","options":{"num_ctx":512},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' | jq '.embeddings=[.embeddings[]|length]'

@rick-github commented on GitHub (Oct 25, 2024): It does truncate, it's just the runner throws an `GGML_ASSERT(i01 >= 0 && i01 < ne01) failed` exception and crashes when the number of tokens is close to the maximum allowed and the runner has been started with a context window greater than the actual supported value. If the model is loaded with the context size set to the actual supported context size, it works fine: ```console curl -s localhost:11434/api/embed -d '{"model":"viosay/conan-embedding-v1","options":{"num_ctx":512},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' | jq '.embeddings=[.embeddings[]|length]' ```

GiteaMirror commented

2026-04-28 18:32:49 -05:00

@mokby commented on GitHub (Oct 25, 2024):

@viosay Hi, I met the same error, do you have any solution to solve it? I test many embeddings, but only mxbai-embed-large and nomic-embed-text works fine.

@mokby commented on GitHub (Oct 25, 2024): @viosay Hi, I met the same error, do you have any solution to solve it? I test many embeddings, but only `mxbai-embed-large` and `nomic-embed-text` works fine.

GiteaMirror commented

2026-04-28 18:32:50 -05:00

@viosay commented on GitHub (Oct 25, 2024):

@viosay Hi, I met the same error, do you have any solution to solve it? I test many embeddings, but only mxbai-embed-large and nomic-embed-text works fine.

There’s no good solution for now. Either roll back to version 0.3.13, or try setting the actual context size as Rick suggested above. However, there’s a risk of semantic loss in the returned embeddings. It might be best to first try splitting the text into chunks smaller than the embedding length on the client side, but due to differences in token calculation methods, your chunks may not match the model’s segmentation precisely. For example, when I use the jtokkit library from SpringAI for token calculation and segmentation, it oddly treats multiple consecutive dots as a single token. This results in the actual chunks being much larger than expected, which still causes the error.

@viosay commented on GitHub (Oct 25, 2024): > @viosay Hi, I met the same error, do you have any solution to solve it? I test many embeddings, but only `mxbai-embed-large` and `nomic-embed-text` works fine. There’s no good solution for now. Either roll back to version 0.3.13, or try setting the actual context size as Rick suggested above. However, there’s a risk of semantic loss in the returned embeddings. It might be best to first try splitting the text into chunks smaller than the embedding length on the client side, but due to differences in token calculation methods, your chunks may not match the model’s segmentation precisely. For example, when I use the jtokkit library from SpringAI for token calculation and segmentation, it oddly treats multiple consecutive dots as a single token. This results in the actual chunks being much larger than expected, which still causes the error.

GiteaMirror commented

2026-04-28 18:32:50 -05:00

@zydmtaichi commented on GitHub (Oct 29, 2024):

Hi @viosay and @rick-github ,
I met the same error and tried to reduce the length of chunk tokens sent to embedding api. However, it seems not working. I use model milkey/gte:large-zh-f16 and set embedding_dim to 1024 with chunk_token_size equals to 1200 in lightrag framework which access embed model via ollama. The 500 internal err is not changed even I modify the chunk_token_size to 400.

@zydmtaichi commented on GitHub (Oct 29, 2024): Hi @viosay and @rick-github , I met the same error and tried to reduce the length of chunk tokens sent to embedding api. However, it seems not working. I use model `milkey/gte:large-zh-f16` and set embedding_dim to 1024 with chunk_token_size equals to 1200 in lightrag framework which access embed model via ollama. The 500 internal err is not changed even I modify the chunk_token_size to 400.

GiteaMirror commented

2026-04-28 18:32:51 -05:00

@rick-github commented on GitHub (Oct 29, 2024):

It works if I set num_ctx to 512. Perhaps the lightrag framework is adding extra tokens, or there is an issue wiith chunk_token_size.

$ ollama show milkey/gte:large-zh-f16
  Model                 
  	arch            	bert	  
  	parameters      	324M	  
  	quantization    	F16 	  
  	context length  	512 	  
  	embedding length	1024	  
  	                      

$ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":1200},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}'
{"error":{}}
$ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":512},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' | jq '.embeddings=[.embeddings[]|length]'
{
  "model": "milkey/gte:large-zh-f16",
  "embeddings": [
    1024
  ],
  "total_duration": 5604443786,
  "load_duration": 5467363275,
  "prompt_eval_count": 512
}

@rick-github commented on GitHub (Oct 29, 2024): It works if I set `num_ctx` to 512. Perhaps the lightrag framework is adding extra tokens, or [there is an issue wiith `chunk_token_size`](https://github.com/HKUDS/LightRAG/issues/102). ```console $ ollama show milkey/gte:large-zh-f16 Model arch bert parameters 324M quantization F16 context length 512 embedding length 1024 $ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":1200},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' {"error":{}} $ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":512},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' | jq '.embeddings=[.embeddings[]|length]' { "model": "milkey/gte:large-zh-f16", "embeddings": [ 1024 ], "total_duration": 5604443786, "load_duration": 5467363275, "prompt_eval_count": 512 } ```

GiteaMirror commented

2026-04-28 18:32:52 -05:00

@mokby commented on GitHub (Oct 30, 2024):

It works if I set num_ctx to 512. Perhaps the lightrag framework is adding extra tokens, or there is an issue wiith chunk_token_size.

$ ollama show milkey/gte:large-zh-f16
  Model                 
  	arch            	bert	  
  	parameters      	324M	  
  	quantization    	F16 	  
  	context length  	512 	  
  	embedding length	1024	  
  	                      

$ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":1200},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}'
{"error":{}}
$ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":512},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' | jq '.embeddings=[.embeddings[]|length]'
{
  "model": "milkey/gte:large-zh-f16",
  "embeddings": [
    1024
  ],
  "total_duration": 5604443786,
  "load_duration": 5467363275,
  "prompt_eval_count": 512
}

Amazing! That works, thanks for your help!

@mokby commented on GitHub (Oct 30, 2024): > It works if I set `num_ctx` to 512. Perhaps the lightrag framework is adding extra tokens, or [there is an issue wiith `chunk_token_size`](https://github.com/HKUDS/LightRAG/issues/102). > > ``` > $ ollama show milkey/gte:large-zh-f16 > Model > arch bert > parameters 324M > quantization F16 > context length 512 > embedding length 1024 > > > $ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":1200},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' > {"error":{}} > $ curl -s localhost:11434/api/embed -d '{"model":"milkey/gte:large-zh-f16","options":{"num_ctx":512},"input":"The Spring AI project aims for complexity. This project draws inspiration from wellknown Python projects such as LangChain and LlamaIndex, but Spring AI is not a direct port of these projects. The belief behind the establishment of this project is that the next wave of generative AI applications will not only be suitable for Python developers, but will also be ubiquitous in many programming languages. The core of Spring AI is to provide abstraction as the foundation for developing AI applications. These abstractions have multiple implementations and can easily exchange components with minimal code changes. Spring AI Provide the following features: 1. Support all major model providers such as OpenAI, Microsoft, Amazon, Google, and Huggingface. 2. The supported model types include chat and text to image, and more types are currently under development. 3. A portable API across AI providers for chatting and embedding models. Supports synchronization and streaming API options. It also supports dropdown to access model specific features. 4. Map the output of the AI model to POJO. 5. Support all major vector database providers, such as Azure Vector Search, Chroma, Milvus, Neo4j, PostgreSQLPGVector, PineCone, Qdrant, Redis, and Weaviate6. Cross Vector Store providers portable API, including a novel metadata filter API similar to SQL, which is also portable. 7. Function calls 8. Spring Boot auto configuration and launcher for AI models and vector storage. 9. ETL framework for data engineering. With this feature set, you can implement common use cases such as document QA or chatting with documents. The concept section provides a high-level overview of AI concepts and their representation in Spring AI. The Introduction section explains how to create the first AI application. The subsequent sections will adopt a code centered approach to delve into each component and common use cases."}' | jq '.embeddings=[.embeddings[]|length]' > { > "model": "milkey/gte:large-zh-f16", > "embeddings": [ > 1024 > ], > "total_duration": 5604443786, > "load_duration": 5467363275, > "prompt_eval_count": 512 > } > ``` Amazing! That works, thanks for your help!

GiteaMirror commented

2026-04-28 18:32:53 -05:00

@viosay commented on GitHub (Oct 30, 2024):

@mokby This is what I mentioned above about Rick's suggestion to set the actual context size. However, there is a risk of semantic loss in the returned embeddings. The engine might enforce input truncation, and the real reason has yet to be identified. The most likely cause is still issues with token segmentation and calculation.

@viosay commented on GitHub (Oct 30, 2024): @mokby This is what I mentioned above about Rick's suggestion to set the actual context size. However, there is a risk of semantic loss in the returned embeddings. The engine might enforce input truncation, and the real reason has yet to be identified. The most likely cause is still issues with token segmentation and calculation.

GiteaMirror commented

2026-04-28 18:32:54 -05:00

@mokby commented on GitHub (Oct 30, 2024):

Yeah, that's may be a potential problem, can you share your solution if you can handle this issue? Many thanks

@mokby commented on GitHub (Oct 30, 2024): Yeah, that's may be a potential problem, can you share your solution if you can handle this issue? Many thanks

GiteaMirror commented

2026-04-28 18:32:57 -05:00

@viosay commented on GitHub (Nov 16, 2024):

ChatGPT pointed out that the issue lies in the llama_server startup command, where ctx-size was set to 2048, but it should actually be 512. But compared to version 0.3.13, they are consistent.

cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 --ctx-size 2048 --batch-size 512 --threads 6 --no-mmap --parallel 1 --port 7220"

@viosay commented on GitHub (Nov 16, 2024): ChatGPT pointed out that the issue lies in the llama_server startup command, where `ctx-size` was set to 2048, but it should actually be 512. But compared to version 0.3.13, they are consistent. ``` cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 --ctx-size 2048 --batch-size 512 --threads 6 --no-mmap --parallel 1 --port 7220" ```

GiteaMirror commented

2026-04-28 18:32:59 -05:00

@viosay commented on GitHub (Nov 27, 2024):

Provide a reproduction along with the debug logs when OLLAMA_DEBUG=1 is enabled.
And found that the content in the log output is inconsistent with the input text.
OpenAI's tokenizer Through calculation, it can be observed that the input tokens do not exceed the model's maximum supported token count of 512.

curl -s localhost:11434/api/embed -d '{"input":[" ................................................................................................................. 78\n\n ......................................................................................................... 79\n\n ......................................................................................................... 80\n\n ................................................................................................................. 80\n\n ................................................................................................................. 80\n\n ................................................................................................................. 81\n\n\n\n .......................................................................... 81\n\n .................................................................81\n\n......................................................................................................... 82\n\n ............................................................................................................. 83\n\n .......................................................................................................... 84\n\n .......................................................................................................................85\n\n .............................................................................................................. 86\n\n .............................................................................................................. 89\n\n .......................................................................................................................90\n\n .......................................................................................................................93\n\n ...................................................................................96\n\n ............................................................................. 96\n\n .....................................................96\n\n ................................................................................... 101\n\n ................................................................................... 107\n\n ........................................................................................"],"model":"viosay/conan-embedding-v1","options":{}}'

time=2024-11-27T16:39:48.728+08:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model"
time=2024-11-27T16:39:48.728+08:00 level=DEBUG source=server.go:607 msg="model load progress 0.08"
time=2024-11-27T16:39:48.991+08:00 level=DEBUG source=server.go:607 msg="model load progress 0.19"
time=2024-11-27T16:39:49.254+08:00 level=DEBUG source=server.go:607 msg="model load progress 0.28"
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    26.00 MiB
llama_new_context_with_model: graph nodes  = 851
llama_new_context_with_model: graph splits = 1
time=2024-11-27T16:39:49.516+08:00 level=INFO source=server.go:601 msg="llama runner started in 1.05 seconds"
time=2024-11-27T16:39:49.516+08:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68
time=2024-11-27T16:39:49.524+08:00 level=DEBUG source=server.go:965 msg="new runner detected, loading model for cgo tokenization"
llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = model
llama_model_loader: - kv   2:                           bert.block_count u32              = 24
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,21128]   = ["[PAD]", "[unused1]", "[unused2]", "...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,21128]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,21128]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:  243 tensors
llama_model_loader: - type  f16:  146 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.0769 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 21128
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 1
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 324.47 M
llm_load_print_meta: model size       = 619.50 MiB (16.02 BPW) 
llm_load_print_meta: general.name     = model
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
llm_load_print_meta: EOG token        = 102 '[SEP]'
llm_load_print_meta: max token length = 48
llama_model_load: vocab only - skipping tensors
time=2024-11-27T16:39:49.545+08:00 level=DEBUG source=runner.go:744 msg="embedding request" content=" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ."
time=2024-11-27T16:39:49.551+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=514 used=0 remaining=514
ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failedggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed

time=2024-11-27T16:39:51.654+08:00 level=ERROR source=routes.go:453 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:10390/embedding\": read tcp 127.0.0.1:10395->127.0.0.1:10390: wsarecv: An existing connection was forcibly closed by the remote host."
[GIN] 2024/11/27 - 16:39:51 | 500 |    3.2233719s |   192.168.7.247 | POST     "/api/embed"
time=2024-11-27T16:39:51.654+08:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2024-11-27T16:39:51.654+08:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 duration=5m0s
time=2024-11-27T16:39:51.654+08:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 refCount=0
time=2024-11-27T16:39:51.710+08:00 level=DEBUG source=server.go:423 msg="llama runner terminated" error="exit status 0xc0000409"

@viosay commented on GitHub (Nov 27, 2024): Provide a reproduction along with the debug logs when `OLLAMA_DEBUG=1` is enabled. And found that the content in the log output is inconsistent with the input text. [OpenAI's tokenizer](https://platform.openai.com/tokenizer) Through calculation, it can be observed that the input tokens do not exceed the model's maximum supported token count of 512. ![](https://github.com/user-attachments/assets/8f5aa78e-cbe3-4d6c-98fc-cf0ef32463ea) ![](https://github.com/user-attachments/assets/75ee7519-aeab-440f-a4ef-1f0c4b759f69) ![](https://github.com/user-attachments/assets/03058231-6b8b-4cff-899c-16270e0822d6) ``` curl -s localhost:11434/api/embed -d '{"input":[" ................................................................................................................. 78\n\n ......................................................................................................... 79\n\n ......................................................................................................... 80\n\n ................................................................................................................. 80\n\n ................................................................................................................. 80\n\n ................................................................................................................. 81\n\n\n\n .......................................................................... 81\n\n .................................................................81\n\n......................................................................................................... 82\n\n ............................................................................................................. 83\n\n .......................................................................................................... 84\n\n .......................................................................................................................85\n\n .............................................................................................................. 86\n\n .............................................................................................................. 89\n\n .......................................................................................................................90\n\n .......................................................................................................................93\n\n ...................................................................................96\n\n ............................................................................. 96\n\n .....................................................96\n\n ................................................................................... 101\n\n ................................................................................... 107\n\n ........................................................................................"],"model":"viosay/conan-embedding-v1","options":{}}' ``` ``` time=2024-11-27T16:39:48.728+08:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model" time=2024-11-27T16:39:48.728+08:00 level=DEBUG source=server.go:607 msg="model load progress 0.08" time=2024-11-27T16:39:48.991+08:00 level=DEBUG source=server.go:607 msg="model load progress 0.19" time=2024-11-27T16:39:49.254+08:00 level=DEBUG source=server.go:607 msg="model load progress 0.28" llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 192.00 MiB llama_new_context_with_model: KV self size = 192.00 MiB, K (f16): 96.00 MiB, V (f16): 96.00 MiB llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: CPU compute buffer size = 26.00 MiB llama_new_context_with_model: graph nodes = 851 llama_new_context_with_model: graph splits = 1 time=2024-11-27T16:39:49.516+08:00 level=INFO source=server.go:601 msg="llama runner started in 1.05 seconds" time=2024-11-27T16:39:49.516+08:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 time=2024-11-27T16:39:49.524+08:00 level=DEBUG source=server.go:965 msg="new runner detected, loading model for cgo tokenization" llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = model llama_model_loader: - kv 2: bert.block_count u32 = 24 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 1024 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 4096 llama_model_loader: - kv 6: bert.attention.head_count u32 = 16 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,21128] = ["[PAD]", "[unused1]", "[unused2]", "... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,21128] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,21128] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 243 tensors llama_model_loader: - type f16: 146 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 5 llm_load_vocab: token to piece cache size = 0.0769 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 21128 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 324.47 M llm_load_print_meta: model size = 619.50 MiB (16.02 BPW) llm_load_print_meta: general.name = model llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' llm_load_print_meta: EOG token = 102 '[SEP]' llm_load_print_meta: max token length = 48 llama_model_load: vocab only - skipping tensors time=2024-11-27T16:39:49.545+08:00 level=DEBUG source=runner.go:744 msg="embedding request" content=" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ." time=2024-11-27T16:39:49.551+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=514 used=0 remaining=514 ggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failedggml.c:13343: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed time=2024-11-27T16:39:51.654+08:00 level=ERROR source=routes.go:453 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:10390/embedding\": read tcp 127.0.0.1:10395->127.0.0.1:10390: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2024/11/27 - 16:39:51 | 500 | 3.2233719s | 192.168.7.247 | POST "/api/embed" time=2024-11-27T16:39:51.654+08:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-11-27T16:39:51.654+08:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 duration=5m0s time=2024-11-27T16:39:51.654+08:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=C:\Users\Administrator\.ollama\models\blobs\sha256-9e8e196fa3f73c32fb1b37503d5c28b166f4a96db54addd89927c47e4e40cf68 refCount=0 time=2024-11-27T16:39:51.710+08:00 level=DEBUG source=server.go:423 msg="llama runner terminated" error="exit status 0xc0000409" ```

GiteaMirror commented

2026-04-28 18:33:00 -05:00

@viosay commented on GitHub (Nov 27, 2024):

@rick-github I hope the latest debugging logs I provide will help identify the issue. Thanks.
Additionally, after calculation, it was confirmed that the text in the previous example does not exceed the 512-token limit.

@viosay commented on GitHub (Nov 27, 2024): @rick-github I hope the latest debugging logs I provide will help identify the issue. Thanks. Additionally, after calculation, it was confirmed that the text in the previous example does not exceed the 512-token limit. ![](https://github.com/user-attachments/assets/a838ff10-7edb-4a06-9ef9-3b16d5ad0244) ![](https://github.com/user-attachments/assets/90f7b826-616e-44af-a419-3f42322d9e5b) ![](https://github.com/user-attachments/assets/889eee6c-189e-4ff1-9d21-839b2944d668)

GiteaMirror commented

2026-04-28 18:33:01 -05:00

@rick-github commented on GitHub (Dec 23, 2024):

The tokenizer used by OpenaAI is different to the tokenizer used by conan-embedding-v1. You can see from your screenshots that all three OpenAI models return a different token count for the same text. The prompt that you are using with OpenAI is not quite the same as the one you provided (1893 characters vs 1905 characters) so we need to knock off a couple of tokens, but conan-embedding-v1 creates 571 tokens.

@rick-github commented on GitHub (Dec 23, 2024): The tokenizer used by OpenaAI is different to the tokenizer used by conan-embedding-v1. You can see from your screenshots that all three OpenAI models return a different token count for the same text. The prompt that you are using with OpenAI is not quite the same as the one you provided (1893 characters vs 1905 characters) so we need to knock off a couple of tokens, but conan-embedding-v1 creates 571 tokens. ![image](https://github.com/user-attachments/assets/728ebf56-77a8-47cd-a6ca-5f356d33f3ed)

GiteaMirror commented

2026-04-28 18:33:02 -05:00

@rick-github commented on GitHub (Jan 15, 2025):

Just to summarize the content from above:

The problem is that the context length that ollama is using is longer than the context length that the embedding model supports. If num_ctx is not supplied in the API call or the Modelfile, ollama will use a default context length of 2048. If this is longer than the context length of the library, a client can send a request longer than the model can accommodate and can cause the runner to crash.

The models in the ollama library currently have the attributes in the table below. Models that are a crash risk with the default parameters are marked:

model	model context_length	Modelfile num_ctx	effective num_ctx	crash
nomic-embed-text	2048	8192	2048
mxbai-embed-large	512	512	512
snowflake-arctic-embed	512	-	2048	✔
all-minilm	512	256	256
bge-m3	8192	-	2048
bge-large	512	-	2048	✔
paraphrase-multilingual	512	128	128
snowflake-arctic-embed2	8192	-	2048
granite-embedding	512	-	2048	✔

You can prevent these errors by setting num_ctx in the API call (eg "options":{"num_ctx":512}), or modifying the model to specify the context length:

ollama cp bge-large:latest bge-large:original
ollama rm bge-large:latest
ollama show --modelfile bge-large:original > Modelfile
echo PARAMETER num_ctx 512 >> Modelfile
ollama create -f Modelfile bge-large:latest

Note that the reason the errors are occurring is because ollama is getting embeddings for text lengths greater than that supported by the model. This means that the text will be truncated and the embeddings will be losing semantic content. The chunk size of the embedding client should be less than context_length.

@rick-github commented on GitHub (Jan 15, 2025): Just to summarize the content from above: The problem is that the context length that ollama is using is longer than the context length that the embedding model supports. If `num_ctx` is not supplied in the API call or the Modelfile, ollama will use a default context length of 2048. If this is longer than the context length of the library, a client can send a request longer than the model can accommodate and can cause the runner to crash. The models in the ollama library currently have the attributes in the table below. Models that are a crash risk with the default parameters are marked: | model | model context_length | Modelfile num_ctx | effective num_ctx | crash | |---|---|---|---|---| | [nomic-embed-text](https://ollama.com/library/nomic-embed-text) | 2048 | 8192 | 2048 | | | [mxbai-embed-large](https://ollama.com/library/mxbai-embed-large) | 512 | 512 | 512 | | | [snowflake-arctic-embed](https://ollama.com/library/snowflake-arctic-embed) | 512 | - | 2048 | ✔ | | [all-minilm](https://ollama.com/library/all-minilm) | 512 | 256 | 256 | | | [bge-m3](https://ollama.com/library/bge-m3) | 8192 | - | 2048 | | | [bge-large](https://ollama.com/library/bge-large) | 512 | - | 2048 | ✔ | | [paraphrase-multilingual](https://ollama.com/library/paraphrase-multilingual) | 512 | 128 | 128 | | | [snowflake-arctic-embed2](https://ollama.com/library/snowflake-arctic-embed2) | 8192 | - | 2048 | | | [granite-embedding](https://ollama.com/library/granite-embedding) | 512 | - | 2048 | ✔ | You can prevent these errors by setting `num_ctx` in the API call (eg `"options":{"num_ctx":512}`), or modifying the model to specify the context length: ```console ollama cp bge-large:latest bge-large:original ollama rm bge-large:latest ollama show --modelfile bge-large:original > Modelfile echo PARAMETER num_ctx 512 >> Modelfile ollama create -f Modelfile bge-large:latest ``` Note that the reason the errors are occurring is because ollama is getting embeddings for text lengths greater than that supported by the model. This means that the text will be truncated and the embeddings will be losing semantic content. The chunk size of the embedding client should be less than `context_length`.

GiteaMirror commented

2026-04-28 18:33:03 -05:00

@rick-github commented on GitHub (Jan 15, 2025):

To dig a bit deeper: the root cause is a mis-calculation in the truncation logic. The prompt is truncated to num_ctx at the entry point of the API, but further down the call tree BOS and EOS tokens are added, taking the input buffer to (say) 514 tokens rather than 512. There's more logic in the cache that tries to handle this but doesn't work when num_ctx >> context_length. Unfortunately, when the cache logic does kick in, it removes tokens from the start of the input, which is likely to impact the usefulness of the embedding.

@rick-github commented on GitHub (Jan 15, 2025): To dig a bit deeper: the root cause is a mis-calculation in the truncation logic. The prompt is truncated to `num_ctx` at the entry point of the API, but further down the call tree BOS and EOS tokens are added, taking the input buffer to (say) 514 tokens rather than 512. There's more logic in the cache that tries to handle this but doesn't work when num_ctx >> context_length. Unfortunately, when the cache logic does kick in, it removes tokens from the start of the input, which is likely to impact the usefulness of the embedding.

GiteaMirror commented

2026-04-28 18:33:04 -05:00

@viosay commented on GitHub (Feb 26, 2025):

Reconsidering this issue, when an exception occurs, the request input is completely normal, with characters properly segmented according to length, as shown in the example below. However, after reviewing the debug, it was found that spaces were added between each character in the output request content, which resulted in an increase in the number of tokens. The length exceeds the limit after adding spaces, causing an overflow and resulting in an error at 1026.

curl -s 192.168.7.210:11434/api/embed -d '{"model":"shaw/dmeta-embedding-zh","input":"府可以从白皮书中借鉴到如何体系性规划，服务企业并促进经济建设的思\\路；工业企业可以了解头部企业在工业 APP 开发、应用、技术转化等方面\\n的经验；平台企业可以从白皮书中归纳分析出自己平台的发展方向与应用\\n推进路径。各主体都能够从本白皮书中找到如何让工业 APP 应用落地的抓\\n手，推进工业 APP 生态建设与发展。\\n\\nIX\\n目  录\\n编写说明 ..................................................................................................................... II\\n前  言 ................................................................................................................... VIII\\n目  录 ...................................................................................................................... IX\\n1 工业 APP 的概念与内涵 ................................................ 1\\n1.1 工业 APP 发展的背景 ..................................................................................... 1\\n1.2 工业 APP 的概念 ............................................................................................. 2\\n1.2.1 工业 APP 的定义 .............................................................................................. 2\\n1.2.2 工业 APP 的内涵 .............................................................................................. 4\\n1.2.3 工业 APP 的典型特征 ...................................................................................... 5\\n1.3 工业 APP 参考体系架构 ................................................................................. 7\\n1.4 概念辨析 ........................................................................................................... 9\\n1.4.1 工业 APP 与消费 APP 的区别 ....................................................................... 10\\n1.4.2 工业 APP 与工业软件的关系 ..........................................................................11"}'

debug log ：

time=2025-02-26T17:31:48.140+08:00 level=DEBUG source=runner.go:742 msg="embedding request" content=" 府 可 以 从 白 皮 书 中 借 鉴 到 如 何 体 系 性 规 划 ， 服 务 企 业 并 促 进 经 济 建 设 的 思 \\ 路 ； 工 业 企 业 可 以 了 解 头 部 企 业 在 工 业 app 开 发 、 应 用 、 技 术 转 化 等 方 面 \\ n 的 经 验 ； 平 台 企 业 可 以 从 白 皮 书 中 归 纳 分 析 出 自 己 平 台 的 发 展 方 向 与 应 用 \\ n 推 进 路 径 。 各 主 体 都 能 够 从 本 白 皮 书 中 找 到 如 何 让 工 业 app 应 用 落 地 的 抓 \\ n 手 ， 推 进 工 业 app 生 态 建 设 与 发 展 。 \\ n \\ nix \\ n 目 录 \\ n 编 写 说 明 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii \\ n 前 言 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii \\ n 目 录 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix \\ n1 工 业 app 的 概 念 与 内 涵 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 \\ n1 . 1 工 业 app 发 展 的 背 景 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 \\ n1 . 2 工 业 app 的 概 念 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 \\ n1 . 2 . 1 工 业 app 的 定 义 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 \\ n1 . 2 . 2 工 业 app 的 内 涵 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 \\ n1 . 2 . 3 工 业 app 的 典 型 特 征 . . . . . . . ."
time=2025-02-26T17:31:48.141+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=1026 used=0 remaining=1026
C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.c:8374: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.c:8374: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
time=2025-02-26T17:31:50.491+08:00 level=ERROR source=routes.go:478 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:57292/embedding\": read tcp 127.0.0.1:57295->127.0.0.1:57292: wsarecv: An existing connection was forcibly closed by the remote host."

What I want to know is why spaces are automatically added between each character.
https://github.com/ollama/ollama/issues/7288#issuecomment-2503273262 Actually, this issue with spaces has also been reflected in the previous replies.

@viosay commented on GitHub (Feb 26, 2025): Reconsidering this issue, when an exception occurs, the request input is completely normal, with characters properly segmented according to length, as shown in the example below. However, after reviewing the debug, it was found that spaces were added between each character in the output request content, which resulted in an increase in the number of tokens. The length exceeds the limit after adding spaces, causing an overflow and resulting in an error at 1026. ``` curl -s 192.168.7.210:11434/api/embed -d '{"model":"shaw/dmeta-embedding-zh","input":"府可以从白皮书中借鉴到如何体系性规划，服务企业并促进经济建设的思\\路；工业企业可以了解头部企业在工业 APP 开发、应用、技术转化等方面\\n的经验；平台企业可以从白皮书中归纳分析出自己平台的发展方向与应用\\n推进路径。各主体都能够从本白皮书中找到如何让工业 APP 应用落地的抓\\n手，推进工业 APP 生态建设与发展。\\n\\nIX\\n目录\\n编写说明 ..................................................................................................................... II\\n前言 ................................................................................................................... VIII\\n目录 ...................................................................................................................... IX\\n1 工业 APP 的概念与内涵 ................................................ 1\\n1.1 工业 APP 发展的背景 ..................................................................................... 1\\n1.2 工业 APP 的概念 ............................................................................................. 2\\n1.2.1 工业 APP 的定义 .............................................................................................. 2\\n1.2.2 工业 APP 的内涵 .............................................................................................. 4\\n1.2.3 工业 APP 的典型特征 ...................................................................................... 5\\n1.3 工业 APP 参考体系架构 ................................................................................. 7\\n1.4 概念辨析 ........................................................................................................... 9\\n1.4.1 工业 APP 与消费 APP 的区别 ....................................................................... 10\\n1.4.2 工业 APP 与工业软件的关系 ..........................................................................11"}' ``` debug log ： ``` time=2025-02-26T17:31:48.140+08:00 level=DEBUG source=runner.go:742 msg="embedding request" content=" 府可以从白皮书中借鉴到如何体系性规划，服务企业并促进经济建设的思 \\ 路；工业企业可以了解头部企业在工业 app 开发、应用、技术转化等方面 \\ n 的经验；平台企业可以从白皮书中归纳分析出自己平台的发展方向与应用 \\ n 推进路径。各主体都能够从本白皮书中找到如何让工业 app 应用落地的抓 \\ n 手，推进工业 app 生态建设与发展。 \\ n \\ nix \\ n 目录 \\ n 编写说明 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii \\ n 前言 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii \\ n 目录 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix \\ n1 工业 app 的概念与内涵 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 \\ n1 . 1 工业 app 发展的背景 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 \\ n1 . 2 工业 app 的概念 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 \\ n1 . 2 . 1 工业 app 的定义 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 \\ n1 . 2 . 2 工业 app 的内涵 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 \\ n1 . 2 . 3 工业 app 的典型特征 . . . . . . . ." time=2025-02-26T17:31:48.141+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=1026 used=0 remaining=1026 C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.c:8374: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed C:/a/ollama/ollama/ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.c:8374: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed time=2025-02-26T17:31:50.491+08:00 level=ERROR source=routes.go:478 msg="embedding generation failed" error="do embedding request: Post \"http://127.0.0.1:57292/embedding\": read tcp 127.0.0.1:57295->127.0.0.1:57292: wsarecv: An existing connection was forcibly closed by the remote host." ``` What I want to know is why spaces are automatically added between each character. https://github.com/ollama/ollama/issues/7288#issuecomment-2503273262 Actually, this issue with spaces has also been reflected in the previous replies.

GiteaMirror commented

2026-04-28 18:33:05 -05:00

@rick-github commented on GitHub (Feb 26, 2025):

The data in the log are characters, not tokens. The padding is a function of the tokenizer table in shaw/dmeta-embedding-zh. The tokenizer uses sentencepiece and the spaces are represented internally as the special character "▁". In the process of truncating the input to make it fit in the context buffer, the input is tokenized and the detokenized, and the latter step results in the padding. However, the number of tokens is the same. For example, tokenizing "府" (the first glyph from your test input) returns a value of 2424. De-tokenzing the value of 2424 returns the glyph sequence " 府". These are both the same as far as the tokenizer is concerned:

$ diff -u  <(curl -s localhost:11434/api/embed -d '{"model":"shaw/dmeta-embedding-zh","input":"府"}' | jq) <(curl -s localhost:11434/api/embed -d '{"model":"shaw/dmeta-embedding-zh","input":" 府"}' | jq)
--- /dev/fd/63	2025-02-26 15:51:31.545786473 +0100
+++ /dev/fd/62	2025-02-26 15:51:31.545786473 +0100
@@ -772,7 +772,7 @@
       -0.007145582
     ]
   ],
-  "total_duration": 211954403,
-  "load_duration": 198647809,
+  "total_duration": 219156306,
+  "load_duration": 203917779,
   "prompt_eval_count": 1
 }

$ docker compose logs ollama | grep -A2 "embedding request"
ollama  | time=2025-02-26T14:55:08.161Z level=DEBUG source=runner.go:742 msg="embedding request" content=" 府"
ollama  | time=2025-02-26T14:55:08.161Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=3 prompt=3 used=0 remaining=3
ollama  | time=2025-02-26T14:55:08.180Z level=DEBUG source=sched.go:408 msg="context for request finished"
--
ollama  | time=2025-02-26T14:55:08.180Z level=DEBUG source=runner.go:742 msg="embedding request" content=府
ollama  | time=2025-02-26T14:55:08.180Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=3 prompt=3 used=0 remaining=3
ollama  | [GIN] 2025/02/26 - 14:55:08 | 200 |  228.520826ms |      172.19.0.1 | POST     "/api/embed"

@rick-github commented on GitHub (Feb 26, 2025): The data in the log are characters, not tokens. The padding is a function of the tokenizer table in shaw/dmeta-embedding-zh. The tokenizer uses [sentencepiece](https://huggingface.co/docs/transformers/en/tokenizer_summary#sentencepiece) and the spaces are represented internally as the special character "▁". In the process of truncating the input to make it fit in the context buffer, the input is tokenized and the detokenized, and the latter step results in the padding. However, the number of tokens is the same. For example, tokenizing "府" (the first glyph from your test input) returns a value of 2424. De-tokenzing the value of 2424 returns the glyph sequence " 府". These are both the same as far as the tokenizer is concerned: ```console $ diff -u <(curl -s localhost:11434/api/embed -d '{"model":"shaw/dmeta-embedding-zh","input":"府"}' | jq) <(curl -s localhost:11434/api/embed -d '{"model":"shaw/dmeta-embedding-zh","input":" 府"}' | jq) --- /dev/fd/63 2025-02-26 15:51:31.545786473 +0100 +++ /dev/fd/62 2025-02-26 15:51:31.545786473 +0100 @@ -772,7 +772,7 @@ -0.007145582 ] ], - "total_duration": 211954403, - "load_duration": 198647809, + "total_duration": 219156306, + "load_duration": 203917779, "prompt_eval_count": 1 } ``` ```console $ docker compose logs ollama | grep -A2 "embedding request" ollama | time=2025-02-26T14:55:08.161Z level=DEBUG source=runner.go:742 msg="embedding request" content=" 府" ollama | time=2025-02-26T14:55:08.161Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=3 prompt=3 used=0 remaining=3 ollama | time=2025-02-26T14:55:08.180Z level=DEBUG source=sched.go:408 msg="context for request finished" -- ollama | time=2025-02-26T14:55:08.180Z level=DEBUG source=runner.go:742 msg="embedding request" content=府 ollama | time=2025-02-26T14:55:08.180Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=3 prompt=3 used=0 remaining=3 ollama | [GIN] 2025/02/26 - 14:55:08 | 200 | 228.520826ms | 172.19.0.1 | POST "/api/embed" ```

GiteaMirror commented

2026-04-28 18:33:06 -05:00

@viosay commented on GitHub (Feb 27, 2025):

@rick-github Thank you, I think I understand now.

@viosay commented on GitHub (Feb 27, 2025): @rick-github Thank you, I think I understand now.

GiteaMirror commented

2026-04-28 18:33:07 -05:00

@leodeslf commented on GitHub (Mar 24, 2025):

Just in case...

It happened to me. My specific problem was that the default value for num_ctx in llamaindex exceeded that of the model I was trying to use.

// E.g.:
Settings.embedModel = new OllamaEmbedding({
  model: 'granite-embedding:278m',
  options: {
    num_ctx: 512, // <-- This fixed my issue.
  },
});

If you are here, you probably want to check out if there are conflicts between the defaults of whatever tool you are using and those of your specific model.

@leodeslf commented on GitHub (Mar 24, 2025): Just in case... It happened to me. My specific problem was that the default value for `num_ctx` in `llamaindex` exceeded that of the model I was trying to use. ```ts // E.g.: Settings.embedModel = new OllamaEmbedding({ model: 'granite-embedding:278m', options: { num_ctx: 512, // <-- This fixed my issue. }, }); ``` If you are here, you probably want to check out if there are conflicts between the defaults of whatever tool you are using and those of your specific model.

GiteaMirror commented

2026-04-28 18:33:08 -05:00

@rick-github commented on GitHub (Mar 24, 2025):

Now that the switch to 0.6 has happened and the new runner architecture looks a bit stable, I will look at creating a PR to fix this.

@rick-github commented on GitHub (Mar 24, 2025): Now that the switch to 0.6 has happened and the new runner architecture looks a bit stable, I will look at creating a PR to fix this.

GiteaMirror commented

2026-04-28 18:33:08 -05:00

@wikty commented on GitHub (Apr 13, 2025):

I'm the maintainer of dmeta-embedding-zh, compatibility issue with ollama 0.6.x has been fixed, please re-download the model:

ollama rm shaw/dmeta-embedding-zh

ollama pull shaw/dmeta-embedding-zh

@wikty commented on GitHub (Apr 13, 2025): I'm the maintainer of dmeta-embedding-zh, compatibility issue with ollama 0.6.x has been fixed, please re-download the model: ``` ollama rm shaw/dmeta-embedding-zh ollama pull shaw/dmeta-embedding-zh ```

GiteaMirror commented

2026-04-28 18:33:10 -05:00

@ynott commented on GitHub (Apr 7, 2026):

Adding a data point for posterity (low priority)

Hit what looks like the same family of bug on Ollama v0.20.2 with jeffh/intfloat-multilingual-e5-small:q8_0 (BERT/XLM-R, GGUF Q8_0). Not asking for action — just leaving a record in case someone else lands here from a search.

Symptom

POST /api/embed (and /api/embeddings) crashes the runner with:

/ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:4625: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
signal: aborted (core dumped)

The runner is started with --ollama-engine (new engine path).

It's input-dependent, not "Japanese breaks it"

I wrote a small repro script(gist) that runs 17 single-string inputs against /api/embed. Same model, same endpoint, no batching:

Label                          | Input                     | Result
-------------------------------|---------------------------|-------
ASCII single char              | .                         | ok
ASCII two chars                | ab                        | ok
ASCII word                     | est                       | ok
Single katakana 'te'           | テ                        | ok
Single katakana 'su'           | ス                        | ok
Katakana 'suta'                | スタ                       | ok
Katakana 'tesu'                | テス                       | CRASH
Katakana 'kana'                | カナ                       | CRASH
Single hiragana 'a'            | あ                        | ok
Hiragana 'ai'                  | あい                       | ok
Kanji two chars 'nihon'        | 日本                       | ok
Kanji three 'nihongo'          | 日本語                     | ok
Ideographic full stop          | 。                         | CRASH
Ideographic comma              | 、                         | CRASH
Japanese sentence              | テスト文章です。            | CRASH
e5 prefixed sentence           | query: テスト文章です。      | CRASH
Spaced ideographs              | 日 本 語                   | ok

So it's not "multibyte input", input length, or leading whitespace. Single-character katakana like ス and テ work fine, but the 2-character テス and カナ crash. Single kanji and hiragana of any length I tested work. Inputs containing 。 or 、 always crash.

Best guess: SentencePiece is producing certain merged-token IDs (e.g. ▁テス, ▁カナ, ▁。) that fall outside the valid embedding-table index range under the new engine, while single-character tokens or other merge paths stay within range. Notably, ollama run "テスト文章です。" works fine with the same model — only /api/embed crashes — which suggests the issue is in the embedding-specific code path rather than the tokenizer in general.

Same script against `bge-m3` on the same Ollama instance

  Total tests : 17
  OK          : 17
  Crashed     : 0

All inputs returned embeddings successfully. This model is not affected.

So the issue is specific to this particular model + new engine combination, not the embedding code path as a whole.

Things that did NOT help

num_ctx: 512 via Modelfile (the workaround that helped in #8431)
truncate: false, keep_alive: 0, options.num_ctx: 512 in the request body
OLLAMA_NEW_ENGINE=false env var (no longer respected in v0.20.2 — runner still launches with --ollama-engine)
Using the legacy /api/embeddings endpoint instead of /api/embed

What did help

Switching to bge-m3 (official Ollama library). Same endpoint, same Ollama version, same hardware, same exact inputs — clean embeddings every time, including batched requests.

Environment

Ollama v0.20.2, --ollama-engine runner
Linux, NVIDIA RTX 2070, driver 580.126.20, CUDA 13
Affected model: jeffh/intfloat-multilingual-e5-small:q8_0 (architecture: bert, n_ctx_train: 512, embedding length: 384, Q8_0)
Working model: bge-m3

#7288 since this looks like part of the same broader cluster of /api/embed + GGML_ASSERT(i01 >= 0 && i01 < ne01) regressions reported here over multiple versions. Workaround for anyone hitting this from a search: try a different embedding model (e.g. bge-m3) before deeper debugging.

@ynott commented on GitHub (Apr 7, 2026): **Adding a data point for posterity (low priority)** Hit what looks like the same family of bug on Ollama v0.20.2 with `jeffh/intfloat-multilingual-e5-small:q8_0` (BERT/XLM-R, GGUF Q8_0). Not asking for action — just leaving a record in case someone else lands here from a search. ## Symptom `POST /api/embed` (and `/api/embeddings`) crashes the runner with: ``` /ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:4625: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed signal: aborted (core dumped) ``` The runner is started with `--ollama-engine` (new engine path). ## It's input-dependent, not "Japanese breaks it" I wrote a [small repro script(gist)](https://gist.github.com/ynott/91f80510a3ae31d84fe1be35cc1747d6) that runs 17 single-string inputs against `/api/embed`. Same model, same endpoint, no batching: ``` Label | Input | Result -------------------------------|---------------------------|------- ASCII single char | . | ok ASCII two chars | ab | ok ASCII word | est | ok Single katakana 'te' | テ | ok Single katakana 'su' | ス | ok Katakana 'suta' | スタ | ok Katakana 'tesu' | テス | CRASH Katakana 'kana' | カナ | CRASH Single hiragana 'a' | あ | ok Hiragana 'ai' | あい | ok Kanji two chars 'nihon' | 日本 | ok Kanji three 'nihongo' | 日本語 | ok Ideographic full stop | 。 | CRASH Ideographic comma | 、 | CRASH Japanese sentence | テスト文章です。 | CRASH e5 prefixed sentence | query: テスト文章です。 | CRASH Spaced ideographs | 日本語 | ok ``` So it's not "multibyte input", input length, or leading whitespace. Single-character katakana like `ス` and `テ` work fine, but the 2-character `テス` and `カナ` crash. Single kanji and hiragana of any length I tested work. Inputs containing `。` or `、` always crash. Best guess: SentencePiece is producing certain merged-token IDs (e.g. `▁テス`, `▁カナ`, `▁。`) that fall outside the valid embedding-table index range under the new engine, while single-character tokens or other merge paths stay within range. Notably, `ollama run "テスト文章です。"` works fine with the same model — only `/api/embed` crashes — which suggests the issue is in the embedding-specific code path rather than the tokenizer in general. ## Same script against `bge-m3` on the same Ollama instance ``` Total tests : 17 OK : 17 Crashed : 0 All inputs returned embeddings successfully. This model is not affected. ``` So the issue is specific to this particular model + new engine combination, not the embedding code path as a whole. ## Things that did NOT help - `num_ctx: 512` via Modelfile (the workaround that helped in #8431) - `truncate: false`, `keep_alive: 0`, `options.num_ctx: 512` in the request body - `OLLAMA_NEW_ENGINE=false` env var (no longer respected in v0.20.2 — runner still launches with `--ollama-engine`) - Using the legacy `/api/embeddings` endpoint instead of `/api/embed` ## What did help Switching to `bge-m3` (official Ollama library). Same endpoint, same Ollama version, same hardware, same exact inputs — clean embeddings every time, including batched requests. ## Environment - Ollama v0.20.2, `--ollama-engine` runner - Linux, NVIDIA RTX 2070, driver 580.126.20, CUDA 13 - Affected model: `jeffh/intfloat-multilingual-e5-small:q8_0` (architecture: bert, n_ctx_train: 512, embedding length: 384, Q8_0) - Working model: `bge-m3` #7288 since this looks like part of the same broader cluster of `/api/embed` + `GGML_ASSERT(i01 >= 0 && i01 < ne01)` regressions reported here over multiple versions. **Workaround for anyone hitting this from a search: try a different embedding model (e.g. `bge-m3`) before deeper debugging.**

GiteaMirror commented

2026-04-28 18:33:12 -05:00

@joaquinariasco-lab commented on GitHub (Apr 17, 2026):

Can you share a minimal reproducible example including: the exact “multiple fragments” payload you send (array vs concatenated string), the approximate token/character length of each fragment, the model + version used, and your truncate / context length settings, so we can determine whether the GGML_ASSERT failure is triggered by batching behavior or by exceeding the model’s embedding window?

@joaquinariasco-lab commented on GitHub (Apr 17, 2026): Can you share a minimal reproducible example including: the exact “multiple fragments” payload you send (array vs concatenated string), the approximate token/character length of each fragment, the model + version used, and your truncate / context length settings, so we can determine whether the GGML_ASSERT failure is triggered by batching behavior or by exceeding the model’s embedding window?

GiteaMirror commented

2026-04-28 18:33:14 -05:00

@ynott commented on GitHub (Apr 23, 2026):

@joaquinariasco-lab Thank you for the follow-up.

Environment:

Ollama: 0.20.2
Model: jeffh/intfloat-multilingual-e5-small:q8_0
Endpoint: /api/embed
Linux

Below is the simplest command to reproduce the issue:

$ ollama pull jeffh/intfloat-multilingual-e5-small:q8_0
$ curl -sS http://127.0.0.1:11434/api/embed \
-H 'Content-Type: application/json' \
-d '{"model":"jeffh/intfloat-multilingual-e5-small:q8_0","input":"テス"}'

This fails on my side.
For comparison, this succeeds:

$ curl -sS http://127.0.0.1:11434/api/embed \
-H 'Content-Type: application/json' \
-d '{"model":"jeffh/intfloat-multilingual-e5-small:q8_0","input":"テ"}'

I also tested a multiple-fragments payload (failed):

$ curl -sS http://127.0.0.1:11434/api/embed \
  -H 'Content-Type: application/json' \
  -d '{"model":"jeffh/intfloat-multilingual-e5-small:q8_0","input":["テス","カナ"]}'

I also tested the same inputs with the Ollama CLI:

$ ollama run jeffh/intfloat-multilingual-e5-small:q8_0 "テス"

- Failed

$ ollama run jeffh/intfloat-multilingual-e5-small:q8_0 "テ"

- Success

$ ollama run jeffh/intfloat-multilingual-e5-small:q8_0 "テスト"

- Success

Observed results:

["テス","カナ"] -> FAIL
"テス" -> FAIL
"カナ" -> FAIL
"テ" -> OK
"ス" -> OK
"テスト" -> OK

Approximate lengths:

"テス": 2 chars / 6 bytes
"カナ": 2 chars / 6 bytes

This does reproduce with a multiple-fragment payload, but also with very short single-string inputs. For this reason, this reproduction does not appear to be specific to batching, nor does it appear to be due to exceeding the embedding window.

I have uploaded the debug log here:
https://gist.github.com/ynott/8a9ad624e8aae8dfcb8221a343e4030b

Relevant details from the failing "テス" case:

Loading cache slot ... prompt=3
GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
runner.num_ctx=4096
signal: aborted (core dumped)

@ynott commented on GitHub (Apr 23, 2026): @joaquinariasco-lab Thank you for the follow-up. Environment: - Ollama: 0.20.2 - Model: jeffh/intfloat-multilingual-e5-small:q8_0 - Endpoint: /api/embed - Linux Below is the simplest command to reproduce the issue: ```bash $ ollama pull jeffh/intfloat-multilingual-e5-small:q8_0 $ curl -sS http://127.0.0.1:11434/api/embed \ -H 'Content-Type: application/json' \ -d '{"model":"jeffh/intfloat-multilingual-e5-small:q8_0","input":"テス"}' ``` <img width="997" height="92" alt="Image" src="https://github.com/user-attachments/assets/4ef05f25-7829-4e6e-bbea-bb33925a1bb6" /> This fails on my side. For comparison, this succeeds: ```bash $ curl -sS http://127.0.0.1:11434/api/embed \ -H 'Content-Type: application/json' \ -d '{"model":"jeffh/intfloat-multilingual-e5-small:q8_0","input":"テ"}' ``` <img width="1516" height="100" alt="Image" src="https://github.com/user-attachments/assets/0654138f-5df8-4219-818f-aedf3bce2b39" /> I also tested a multiple-fragments payload (failed): ```bash $ curl -sS http://127.0.0.1:11434/api/embed \ -H 'Content-Type: application/json' \ -d '{"model":"jeffh/intfloat-multilingual-e5-small:q8_0","input":["テス","カナ"]}' ``` <img width="1515" height="73" alt="Image" src="https://github.com/user-attachments/assets/215903d2-f472-470f-89a4-055b72d55a6d" /> I also tested the same inputs with the Ollama CLI: ``` $ ollama run jeffh/intfloat-multilingual-e5-small:q8_0 "テス" ``` <img width="799" height="48" alt="Image" src="https://github.com/user-attachments/assets/0f933373-2a38-4283-a588-993a1e2a6fdd" /> - Failed ```bash $ ollama run jeffh/intfloat-multilingual-e5-small:q8_0 "テ" ``` <img width="874" height="49" alt="Image" src="https://github.com/user-attachments/assets/6d770807-ce0b-4735-8cee-6109b86866bd" /> - Success ```bash $ ollama run jeffh/intfloat-multilingual-e5-small:q8_0 "テスト" ``` <img width="861" height="49" alt="Image" src="https://github.com/user-attachments/assets/7d9ff645-6b7d-466c-9ea6-bce7b91878f7" /> - Success Observed results: - ["テス","カナ"] -> FAIL - "テス" -> FAIL - "カナ" -> FAIL - "テ" -> OK - "ス" -> OK - "テスト" -> OK Approximate lengths: - "テス": 2 chars / 6 bytes - "カナ": 2 chars / 6 bytes This does reproduce with a multiple-fragment payload, but also with very short single-string inputs. For this reason, this reproduction does not appear to be specific to batching, nor does it appear to be due to exceeding the embedding window. I have uploaded the debug log here: https://gist.github.com/ynott/8a9ad624e8aae8dfcb8221a343e4030b Relevant details from the failing "テス" case: - Loading cache slot ... prompt=3 - GGML_ASSERT(i01 >= 0 && i01 < ne01) failed - runner.num_ctx=4096 - signal: aborted (core dumped)

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#51143

[GH-ISSUE #7288] embedding generation failed. wsarecv: An existing connection was forcibly closed by the remote host. #51143

What is the issue?

OS

GPU

CPU

Ollama version

Symptom

It's input-dependent, not "Japanese breaks it"

Same script against bge-m3 on the same Ollama instance

Things that did NOT help

What did help

Environment

Same script against `bge-m3` on the same Ollama instance