[GH-ISSUE #6094] "embedding generation failed: do embedding request: Post \"http://127.0.0.1:33967/embedding\": EOF" #29571

Closed
opened 2026-04-22 08:33:29 -05:00 by GiteaMirror · 33 comments
Owner

Originally created by @yeexiangzhen1001 on GitHub (Jul 31, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6094

What is the issue?

2024/07/31 09:18:15 routes.go:1099: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-31T09:18:16.095Z level=INFO source=images.go:786 msg="total blobs: 2"
time=2024-07-31T09:18:16.095Z level=INFO source=images.go:793 msg="total unused blobs removed: 0"
time=2024-07-31T09:18:16.095Z level=INFO source=routes.go:1146 msg="Listening on [::]:11434 (version 0.3.1)"
time=2024-07-31T09:18:16.095Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama37639419/runners
time=2024-07-31T09:18:18.739Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [rocm_v60102 cpu cpu_avx cpu_avx2 cuda_v11]"
time=2024-07-31T09:18:18.739Z level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-31T09:18:18.808Z level=INFO source=types.go:105 msg="inference compute" id=GPU-31fa3c8c-f42e-bade-72ec-f936eb48ac45 library=cuda compute=8.6 driver=12.2 name="NVIDIA GeForce RTX 3090 Ti" total="23.7 GiB" available="17.2 GiB"
time=2024-07-31T09:20:14.214Z level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 gpu=GPU-31fa3c8c-f42e-bade-72ec-f936eb48ac45 parallel=4 available=18469158912 required="737.9 MiB"
time=2024-07-31T09:20:14.214Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[17.2 GiB]" memory.required.full="737.9 MiB" memory.required.partial="737.9 MiB" memory.required.kv="24.0 MiB" memory.required.allocations="[737.9 MiB]" memory.weights.total="186.5 MiB" memory.weights.repeating="155.5 MiB" memory.weights.nonrepeating="30.9 MiB" memory.graph.full="48.0 MiB" memory.graph.partial="48.0 MiB"
time=2024-07-31T09:20:14.214Z level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama37639419/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --parallel 4 --port 44985"
time=2024-07-31T09:20:14.214Z level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-31T09:20:14.214Z level=INFO source=server.go:584 msg="waiting for llama runner to start responding"
time=2024-07-31T09:20:14.214Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="6eeaeba" tid="127422522179584" timestamp=1722417614
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="127422522179584" timestamp=1722417614 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="44985" tid="127422522179584" timestamp=1722417614
llama_model_loader: loaded meta data with 22 key-value pairs and 197 tensors from /root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bert
llama_model_loader: - kv 1: general.name str = Dmeta-embedding-zh
llama_model_loader: - kv 2: bert.block_count u32 = 12
llama_model_loader: - kv 3: bert.context_length u32 = 1024
llama_model_loader: - kv 4: bert.embedding_length u32 = 768
llama_model_loader: - kv 5: bert.feed_forward_length u32 = 3072
llama_model_loader: - kv 6: bert.attention.head_count u32 = 12
llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000
llama_model_loader: - kv 8: general.file_type u32 = 1
llama_model_loader: - kv 9: bert.attention.causal bool = false
llama_model_loader: - kv 10: bert.pooling_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 12: tokenizer.ggml.model str = bert
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,21128] = ["[PAD]", "[unused1]", "[unused2]", "...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,21128] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.unknown_token_id u32 = 100
llama_model_loader: - kv 16: tokenizer.ggml.seperator_token_id u32 = 102
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 18: tokenizer.ggml.cls_token_id u32 = 101
llama_model_loader: - kv 19: tokenizer.ggml.mask_token_id u32 = 103
llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 21:
tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - type f32: 123 tensors
llama_model_loader: - type f16: 74 tensors
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.0769 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = bert
llm_load_print_meta: vocab type = WPM
llm_load_print_meta: n_vocab = 21128
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 1024
llm_load_print_meta: n_embd = 768
llm_load_print_meta: n_layer = 12
llm_load_print_meta: n_head = 12
llm_load_print_meta: n_head_kv = 12
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 768
llm_load_print_meta: n_embd_v_gqa = 768
llm_load_print_meta: f_norm_eps = 1.0e-12
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 3072
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 0
llm_load_print_meta: pooling type = 2
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 1024
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 109M
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 102.07 M
llm_load_print_meta: model size = 194.92 MiB (16.02 BPW)
llm_load_print_meta: general.name = Dmeta-embedding-zh
llm_load_print_meta: BOS token = 0 '[PAD]'
llm_load_print_meta: EOS token = 2 '[unused2]'
llm_load_print_meta: UNK token = 100 '[UNK]'
llm_load_print_meta: SEP token = 102 '[SEP]'
llm_load_print_meta: PAD token = 0 '[PAD]'
llm_load_print_meta: CLS token = 101 '[CLS]'
llm_load_print_meta: MASK token = 103 '[MASK]'
llm_load_print_meta: LF token = 0 '[PAD]'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.16 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 13/13 layers to GPU
llm_load_tensors: CPU buffer size = 32.46 MiB
llm_load_tensors: CUDA0 buffer size = 162.46 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 288.00 MiB
llama_new_context_with_model: KV self size = 288.00 MiB, K (f16): 144.00 MiB, V (f16): 144.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.00 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 19.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 4.00 MiB
llama_new_context_with_model: graph nodes = 429
llama_new_context_with_model: graph splits = 2
time=2024-07-31T09:20:14.465Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server loading model"
INFO [main] model loaded | tid="127422522179584" timestamp=1722417614
time=2024-07-31T09:20:14.966Z level=INFO source=server.go:623 msg="llama runner started in 0.75 seconds"
[GIN] 2024/07/31 - 09:20:15 | 200 | 862.184786ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:20:15 | 200 | 91.260258ms | 10.234.218.0 | POST "/api/embeddings"
time=2024-07-31T09:20:15.383Z level=INFO source=routes.go:426 msg="embedding generation failed: do embedding request: Post "http://127.0.0.1:44985/embedding": EOF"
[GIN] 2024/07/31 - 09:20:15 | 500 | 140.114654ms | 10.234.218.0 | POST "/api/embeddings"
time=2024-07-31T09:23:45.923Z level=WARN source=server.go:503 msg="llama runner process no longer running" sys=139 string="signal: segmentation fault (core dumped)"
time=2024-07-31T09:23:50.993Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.069197565 model=/root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527
time=2024-07-31T09:23:51.075Z level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 gpu=GPU-31fa3c8c-f42e-bade-72ec-f936eb48ac45 parallel=4 available=18469158912 required="737.9 MiB"
time=2024-07-31T09:23:51.075Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[17.2 GiB]" memory.required.full="737.9 MiB" memory.required.partial="737.9 MiB" memory.required.kv="24.0 MiB" memory.required.allocations="[737.9 MiB]" memory.weights.total="186.5 MiB" memory.weights.repeating="155.5 MiB" memory.weights.nonrepeating="30.9 MiB" memory.graph.full="48.0 MiB" memory.graph.partial="48.0 MiB"
time=2024-07-31T09:23:51.075Z level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama37639419/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --parallel 4 --port 42155"
time=2024-07-31T09:23:51.075Z level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-31T09:23:51.075Z level=INFO source=server.go:584 msg="waiting for llama runner to start responding"
time=2024-07-31T09:23:51.076Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="6eeaeba" tid="131709034942464" timestamp=1722417831
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="131709034942464" timestamp=1722417831 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="42155" tid="131709034942464" timestamp=1722417831
llama_model_loader: loaded meta data with 22 key-value pairs and 197 tensors from /root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 (version GGUF V3 (latest))
llama_model_loader: D
umping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bert
llama_model_loader: - kv 1: general.name str = Dmeta-embedding-zh
llama_model_loader: - kv 2: bert.block_count u32 = 12
llama_model_loader: - kv 3: bert.context_length u32 = 1024
llama_model_loader: - kv 4: bert.embedding_length u32 = 768
llama_model_loader: - kv 5: bert.feed_forward_length u32 = 3072
llama_model_loader: - kv 6: bert.attention.head_count u32 = 12
llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000
llama_model_loader: - kv 8: general.file_type u32 = 1
llama_model_loader: - kv 9: bert.attention.causal bool = false
llama_model_loader: - kv 10: bert.pooling_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 12: tokenizer.ggml.model str = bert
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,21128] = ["[PAD]", "[unused1]", "[unused2]", "...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,21128] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.unknown_token_id u32 = 100
llama_model_loader: - kv 16: tokenizer.ggml.seperator_token_id u32 = 102
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 18: tokenizer.ggml.cls_token_id u32 = 101
llama_model_loader: - kv 19: tokenizer.ggml.mask_token_id u32 = 103
llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - type f32: 123 tensors
llama_model_loader: - type f16: 74 tensors
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.0769 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = bert
llm_load_print_meta: vocab type = WPM
llm_load_print_meta: n_vocab = 21128
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 1024
llm_load_print_meta: n_embd = 768
llm_load_print_meta: n_layer = 12
llm_load_print_meta: n_head = 12
llm_load_print_meta: n_head_kv = 12
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 768
llm_load_print_meta: n_embd_v_gqa = 768
llm_load_print_meta: f_norm_eps = 1.0e-12
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 3072
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 0
llm_load_print_meta: pooling type = 2
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 1024
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 109M
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 102.07 M
llm_load_print_meta: model size = 194.92 MiB (16.02 BPW)
llm_load_print_meta: general.name = Dmeta-embedding-zh
llm_load_print_meta: BOS token = 0 '[PAD]'
llm_load_print_meta: EOS token = 2 '[unused2]'
llm_load_print_meta: UNK token = 100 '[UNK]'
llm_load_print_meta: SEP token = 102 '[SEP]'
llm_load_print_meta: PAD token = 0 '[PAD]'
llm_load_print_meta: CLS token = 101 '[CLS]'
llm_load_print_meta: MASK token = 103 '[MASK]'
llm_load_print_meta: LF token = 0 '[PAD]'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.16 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 13/13 layers to GPU
llm_load_tensors: CPU buffer size = 32.46 MiB
llm_load_tensors: CUDA0 buffer size = 162.46 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 288.00 MiB
llama_new_context_with_model: KV self size = 288.00 MiB, K (f16): 144.00 MiB, V (f16): 144.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.00 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 19.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 4.00 MiB
llama_new_context_with_model: graph nodes = 429
llama_new_context_with_model: graph splits = 2
time=2024-07-31T09:23:51.243Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.319657234 model=/root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527
time=2024-07-31T09:23:51.327Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server loading model"
INFO [main] model loaded | tid="131709034942464" timestamp=1722417831
time=2024-07-31T09:23:51.829Z level=INFO source=server.go:623 msg="llama runner started in 0.75 seconds"
[GIN] 2024/07/31 - 09:23:51 | 200 | 5.954027368s | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:51 | 200 | 5.997875851s | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:51 | 200 | 6.001301156s | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:51 | 200 | 6.05401596s | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 6.093406397s | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 6.093515843s | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 141.106871ms | 10.234.218.0 | POST "/api/embeddings"
INFO [update_slots] input truncated | n_ctx=2048 n_erase=1989 n_keep=0 n_left=2048 n_shift=1024 tid="131709034942464" timestamp=1722417832
[GIN] 2024/07/31 - 09:23:52 | 200 | 156.396038ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 159.160468ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 155.371305ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 150.237024ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 161.78585ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 158.374292ms | 10.234.218.0 | POST "/api/embeddings"
INFO [update_slots] input truncated | n_ctx=2048 n_erase=1517 n_keep=0 n_left=2048 n_shift=1024 tid="131709034942464" timestamp=1722417832
[GIN] 2024/07/31 - 09:23:52 | 200 | 144.427285ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 192.549717ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 131.371235ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 185.844931ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 151.950066ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 141.888776ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 171.173954ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 130.251712ms | 10.234.218.0 | POST "/api/embeddings"
INFO [update_slots] input truncated | n_ctx=2048 n_erase=1709 n_keep=0 n_left=2048 n_shift=1024 tid="131709034942464" timestamp=1722417832
[GIN] 2024/07/31 - 09:23:52 | 200 | 140.112505ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 171.12123ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 227.184409ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 264.346952ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 189.302007ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 183.643992ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 165.703255ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 200 | 229.741451ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:52 | 500 | 303.282026ms | 10.234.218.0 | POST "/api/embeddings"
time=2024-07-31T09:23:52.825Z level=INFO source=routes.go:426 msg="embedding generation failed: do embedding request: Post "http://127.0.0.1:42155/embedding": EOF"
time=2024-07-31T09:23:57.889Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.063724982 model=/root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527
time=2024-07-31T09:23:57.975Z level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 gpu=GPU-31fa3c8c-f42e-bade-72ec-f936eb48ac45 parallel=4 available=18469158912 required="737.9 MiB"
time=2024-07-31T09:23:57.975Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[17.2 GiB]" memory.required.full="737.9 MiB" memory.required.partial="737.9 MiB" memory.required.kv="24.0 MiB" memory.required.allocations="[737.9 MiB]" memory.weights.total="186.5 MiB" memory.weights.repeating="155.5 MiB" memory.weights.nonrepeating="30.9 MiB" memory.graph.full="48.0 MiB" memory.graph.partial="48.0 MiB"
time=2024-07-31T09:23:57.975Z level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama37639419/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 1
3 --parallel 4 --port 33967"
time=2024-07-31T09:23:57.976Z level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-31T09:23:57.976Z level=INFO source=server.go:584 msg="waiting for llama runner to start responding"
time=2024-07-31T09:23:57.976Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="6eeaeba" tid="125558191894528" timestamp=1722417837
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="125558191894528" timestamp=1722417837 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="33967" tid="125558191894528" timestamp=1722417837
llama_model_loader: loaded meta data with 22 key-value pairs and 197 tensors from /root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bert
llama_model_loader: - kv 1: general.name str = Dmeta-embedding-zh
llama_model_loader: - kv 2: bert.block_count u32 = 12
llama_model_loader: - kv 3: bert.context_length u32 = 1024
llama_model_loader: - kv 4: bert.embedding_length u32 = 768
llama_model_loader: - kv 5: bert.feed_forward_length u32 = 3072
llama_model_loader: - kv 6: bert.attention.head_count u32 = 12
llama_model_loader: - kv 7: bert.attention.layer_norm
_epsilon f32 = 0.000000
llama_model_loader: - kv 8: general.file_type u32 = 1
llama_model_loader: - kv 9: bert.attention.causal bool = false
llama_model_loader: - kv 10: bert.pooling_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 12: tokenizer.ggml.model str = bert
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,21128] = ["[PAD]", "[unused1]", "[unused2]", "...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,21128] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.unknown_token_id u32 = 100
llama_model_loader: - kv 16: tokenizer.ggml.seperator_token_id u32 = 102
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 18: tokenizer.ggml.cls_token_id u32 = 101
llama_model_loader: - kv 19: tokenizer.ggml.mask_token_id u32 = 103
llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - type f32: 123 tensors
llama_model_loader: - type f16: 74 tensors
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.0769 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = bert
llm_load_print_meta: vocab type = WPM
llm_load_print_meta: n_vocab = 21128
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 1024
llm_load_print_meta: n_embd = 768
llm_load_print_meta: n_layer = 12
llm_load_print_meta: n_head = 12
llm_load_print_meta: n_head_kv = 12
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 768
llm_load_print_meta: n_embd_v_gqa = 768
llm_load_print_meta: f_norm_eps = 1.0e-12
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 3072
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 0
llm_load_print_meta: pooling type = 2
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 1024
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 109M
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 102.07 M
llm_load_print_meta: model size = 194.92 MiB (16.02 BPW)
llm_load_print_meta: general.name = Dmeta-embedding-zh
llm_load_print_meta: BOS token = 0 '[PAD]'
llm_load_print_meta: EOS token = 2 '[unused2]'
llm_load_print_meta: UNK token = 100 '[UNK]'
llm_load_print_meta: SEP token = 102 '[SEP]'
llm_load_print_meta: PAD token = 0 '[PAD]'
llm_load_print_meta: CLS token = 101 '[CLS]'
llm_load_print_meta: MASK token = 103 '[MASK]'
llm_load_print_meta: LF token = 0 '[PAD]'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.16 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 13/13 layers to GPU
llm_load_tensors: CPU buffer size = 32.46 MiB
llm_load_tensors: CUDA0 buffer size = 162.46 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 288.00 MiB
llama_new_context_with_model: KV self size = 288.00 MiB, K (f16): 144.00 MiB, V (f16): 144.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.00 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 19.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 4.00 MiB
llama_new_context_with_model: graph nodes = 429
llama_new_context_with_model: graph splits = 2
time=2024-07-31T09:23:58.139Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.312995606 model=/root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527
time=2024-07-31T09:23:58.226Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server loading model"
INFO [main] model loaded | tid="125558191894528" timestamp=1722417838
time=2024-07-31T09:23:58.729Z level=INFO source=server.go:623 msg="llama runner started in 0.75 seconds"
[GIN] 2024/07/31 - 09:23:58 | 200 | 6.175518609s | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:58 | 200 | 6.173129645s | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/
31 - 09:23:58 | 200 | 6.181901759s | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:58 | 200 | 6.217999442s | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:58 | 200 | 6.128390115s | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:58 | 200 | 139.275881ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:58 | 200 | 141.805964ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:58 | 200 | 147.553231ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:59 | 200 | 147.626781ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:59 | 200 | 90.649859ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:59 | 200 | 134.183906ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:59 | 200 | 100.703301ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:59 | 200 | 76.093064ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:59 | 200 | 139.579148ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:59 | 200 | 195.963998ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:59 | 200 | 184.951077ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:59 | 200 | 204.863879ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:59 | 200 | 93.607337ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:59 | 200 | 92.691741ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:59 | 200 | 122.460956ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:23:59 | 200 | 164.876363ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:26:50 | 200 | 93.430143ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:26:50 | 200 | 51.56662ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:26:
50 | 200 | 139.845262ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:26:50 | 200 | 48.229681ms | 10.234.218.0 | POST "/api/embeddings"
INFO [update_slots] input truncated | n_ctx=2048 n_erase=1522 n_keep=0 n_left=2048 n_shift=1024 tid="125558191894528" timestamp=1722418010
[GIN] 2024/07/31 - 09:26:50 | 200 | 103.527766ms | 10.234.218.0 | POST "/api/embeddings"
[GIN] 2024/07/31 - 09:26:50 | 500 | 138.709641ms | 10.234.218.0 | POST "/api/embeddings"
time=2024-07-31T09:26:50.849Z level=INFO source=routes.go:426 msg="embedding generation failed: do embedding request: Post "http://127.0.0.1:33967/embedding": EOF"
[GIN] 2024/07/31 - 09:37:35 | 200 | 19.4µs | 127.0.0.1 | GET "/api/version"

OS

Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.3.1

Originally created by @yeexiangzhen1001 on GitHub (Jul 31, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6094 ### What is the issue? 2024/07/31 09:18:15 routes.go:1099: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-07-31T09:18:16.095Z level=INFO source=images.go:786 msg="total blobs: 2" time=2024-07-31T09:18:16.095Z level=INFO source=images.go:793 msg="total unused blobs removed: 0" time=2024-07-31T09:18:16.095Z level=INFO source=routes.go:1146 msg="Listening on [::]:11434 (version 0.3.1)" time=2024-07-31T09:18:16.095Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama37639419/runners time=2024-07-31T09:18:18.739Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [rocm_v60102 cpu cpu_avx cpu_avx2 cuda_v11]" time=2024-07-31T09:18:18.739Z level=INFO source=gpu.go:205 msg="looking for compatible GPUs" time=2024-07-31T09:18:18.808Z level=INFO source=types.go:105 msg="inference compute" id=GPU-31fa3c8c-f42e-bade-72ec-f936eb48ac45 library=cuda compute=8.6 driver=12.2 name="NVIDIA GeForce RTX 3090 Ti" total="23.7 GiB" available="17.2 GiB" time=2024-07-31T09:20:14.214Z level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 gpu=GPU-31fa3c8c-f42e-bade-72ec-f936eb48ac45 parallel=4 available=18469158912 required="737.9 MiB" time=2024-07-31T09:20:14.214Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[17.2 GiB]" memory.required.full="737.9 MiB" memory.required.partial="737.9 MiB" memory.required.kv="24.0 MiB" memory.required.allocations="[737.9 MiB]" memory.weights.total="186.5 MiB" memory.weights.repeating="155.5 MiB" memory.weights.nonrepeating="30.9 MiB" memory.graph.full="48.0 MiB" memory.graph.partial="48.0 MiB" time=2024-07-31T09:20:14.214Z level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama37639419/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --parallel 4 --port 44985" time=2024-07-31T09:20:14.214Z level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-31T09:20:14.214Z level=INFO source=server.go:584 msg="waiting for llama runner to start responding" time=2024-07-31T09:20:14.214Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="6eeaeba" tid="127422522179584" timestamp=1722417614 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="127422522179584" timestamp=1722417614 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="44985" tid="127422522179584" timestamp=1722417614 llama_model_loader: loaded meta data with 22 key-value pairs and 197 tensors from /root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = Dmeta-embedding-zh llama_model_loader: - kv 2: bert.block_count u32 = 12 llama_model_loader: - kv 3: bert.context_length u32 = 1024 llama_model_loader: - kv 4: bert.embedding_length u32 = 768 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 3072 llama_model_loader: - kv 6: bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = bert llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,21128] = ["[PAD]", "[unused1]", "[unused2]", "... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,21128] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 16: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 18: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 19: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - type f32: 123 tensors llama_model_loader: - type f16: 74 tensors llm_load_vocab: special tokens cache size = 5 llm_load_vocab: token to piece cache size = 0.0769 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 21128 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 1024 llm_load_print_meta: n_embd = 768 llm_load_print_meta: n_layer = 12 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 768 llm_load_print_meta: n_embd_v_gqa = 768 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 3072 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 2 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 1024 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 109M llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 102.07 M llm_load_print_meta: model size = 194.92 MiB (16.02 BPW) llm_load_print_meta: general.name = Dmeta-embedding-zh llm_load_print_meta: BOS token = 0 '[PAD]' llm_load_print_meta: EOS token = 2 '[unused2]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' llm_load_print_meta: max token length = 48 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.16 MiB llm_load_tensors: offloading 12 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 13/13 layers to GPU llm_load_tensors: CPU buffer size = 32.46 MiB llm_load_tensors: CUDA0 buffer size = 162.46 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 288.00 MiB llama_new_context_with_model: KV self size = 288.00 MiB, K (f16): 144.00 MiB, V (f16): 144.00 MiB llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 19.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 4.00 MiB llama_new_context_with_model: graph nodes = 429 llama_new_context_with_model: graph splits = 2 time=2024-07-31T09:20:14.465Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server loading model" INFO [main] model loaded | tid="127422522179584" timestamp=1722417614 time=2024-07-31T09:20:14.966Z level=INFO source=server.go:623 msg="llama runner started in 0.75 seconds" [GIN] 2024/07/31 - 09:20:15 | 200 | 862.184786ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:20:15 | 200 | 91.260258ms | 10.234.218.0 | POST "/api/embeddings" time=2024-07-31T09:20:15.383Z level=INFO source=routes.go:426 msg="embedding generation failed: do embedding request: Post \"http://127.0.0.1:44985/embedding\": EOF" [GIN] 2024/07/31 - 09:20:15 | 500 | 140.114654ms | 10.234.218.0 | POST "/api/embeddings" time=2024-07-31T09:23:45.923Z level=WARN source=server.go:503 msg="llama runner process no longer running" sys=139 string="signal: segmentation fault (core dumped)" time=2024-07-31T09:23:50.993Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.069197565 model=/root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 time=2024-07-31T09:23:51.075Z level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 gpu=GPU-31fa3c8c-f42e-bade-72ec-f936eb48ac45 parallel=4 available=18469158912 required="737.9 MiB" time=2024-07-31T09:23:51.075Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[17.2 GiB]" memory.required.full="737.9 MiB" memory.required.partial="737.9 MiB" memory.required.kv="24.0 MiB" memory.required.allocations="[737.9 MiB]" memory.weights.total="186.5 MiB" memory.weights.repeating="155.5 MiB" memory.weights.nonrepeating="30.9 MiB" memory.graph.full="48.0 MiB" memory.graph.partial="48.0 MiB" time=2024-07-31T09:23:51.075Z level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama37639419/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --parallel 4 --port 42155" time=2024-07-31T09:23:51.075Z level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-31T09:23:51.075Z level=INFO source=server.go:584 msg="waiting for llama runner to start responding" time=2024-07-31T09:23:51.076Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="6eeaeba" tid="131709034942464" timestamp=1722417831 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="131709034942464" timestamp=1722417831 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="42155" tid="131709034942464" timestamp=1722417831 llama_model_loader: loaded meta data with 22 key-value pairs and 197 tensors from /root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 (version GGUF V3 (latest)) llama_model_loader: D umping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = Dmeta-embedding-zh llama_model_loader: - kv 2: bert.block_count u32 = 12 llama_model_loader: - kv 3: bert.context_length u32 = 1024 llama_model_loader: - kv 4: bert.embedding_length u32 = 768 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 3072 llama_model_loader: - kv 6: bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = bert llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,21128] = ["[PAD]", "[unused1]", "[unused2]", "... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,21128] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 16: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 18: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 19: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - type f32: 123 tensors llama_model_loader: - type f16: 74 tensors llm_load_vocab: special tokens cache size = 5 llm_load_vocab: token to piece cache size = 0.0769 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 21128 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 1024 llm_load_print_meta: n_embd = 768 llm_load_print_meta: n_layer = 12 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 768 llm_load_print_meta: n_embd_v_gqa = 768 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 3072 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 2 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 1024 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 109M llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 102.07 M llm_load_print_meta: model size = 194.92 MiB (16.02 BPW) llm_load_print_meta: general.name = Dmeta-embedding-zh llm_load_print_meta: BOS token = 0 '[PAD]' llm_load_print_meta: EOS token = 2 '[unused2]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' llm_load_print_meta: max token length = 48 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.16 MiB llm_load_tensors: offloading 12 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 13/13 layers to GPU llm_load_tensors: CPU buffer size = 32.46 MiB llm_load_tensors: CUDA0 buffer size = 162.46 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 288.00 MiB llama_new_context_with_model: KV self size = 288.00 MiB, K (f16): 144.00 MiB, V (f16): 144.00 MiB llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 19.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 4.00 MiB llama_new_context_with_model: graph nodes = 429 llama_new_context_with_model: graph splits = 2 time=2024-07-31T09:23:51.243Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.319657234 model=/root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 time=2024-07-31T09:23:51.327Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server loading model" INFO [main] model loaded | tid="131709034942464" timestamp=1722417831 time=2024-07-31T09:23:51.829Z level=INFO source=server.go:623 msg="llama runner started in 0.75 seconds" [GIN] 2024/07/31 - 09:23:51 | 200 | 5.954027368s | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:51 | 200 | 5.997875851s | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:51 | 200 | 6.001301156s | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:51 | 200 | 6.05401596s | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 6.093406397s | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 6.093515843s | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 141.106871ms | 10.234.218.0 | POST "/api/embeddings" INFO [update_slots] input truncated | n_ctx=2048 n_erase=1989 n_keep=0 n_left=2048 n_shift=1024 tid="131709034942464" timestamp=1722417832 [GIN] 2024/07/31 - 09:23:52 | 200 | 156.396038ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 159.160468ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 155.371305ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 150.237024ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 161.78585ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 158.374292ms | 10.234.218.0 | POST "/api/embeddings" INFO [update_slots] input truncated | n_ctx=2048 n_erase=1517 n_keep=0 n_left=2048 n_shift=1024 tid="131709034942464" timestamp=1722417832 [GIN] 2024/07/31 - 09:23:52 | 200 | 144.427285ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 192.549717ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 131.371235ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 185.844931ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 151.950066ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 141.888776ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 171.173954ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 130.251712ms | 10.234.218.0 | POST "/api/embeddings" INFO [update_slots] input truncated | n_ctx=2048 n_erase=1709 n_keep=0 n_left=2048 n_shift=1024 tid="131709034942464" timestamp=1722417832 [GIN] 2024/07/31 - 09:23:52 | 200 | 140.112505ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 171.12123ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 227.184409ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 264.346952ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 189.302007ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 183.643992ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 165.703255ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 200 | 229.741451ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:52 | 500 | 303.282026ms | 10.234.218.0 | POST "/api/embeddings" time=2024-07-31T09:23:52.825Z level=INFO source=routes.go:426 msg="embedding generation failed: do embedding request: Post \"http://127.0.0.1:42155/embedding\": EOF" time=2024-07-31T09:23:57.889Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.063724982 model=/root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 time=2024-07-31T09:23:57.975Z level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 gpu=GPU-31fa3c8c-f42e-bade-72ec-f936eb48ac45 parallel=4 available=18469158912 required="737.9 MiB" time=2024-07-31T09:23:57.975Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[17.2 GiB]" memory.required.full="737.9 MiB" memory.required.partial="737.9 MiB" memory.required.kv="24.0 MiB" memory.required.allocations="[737.9 MiB]" memory.weights.total="186.5 MiB" memory.weights.repeating="155.5 MiB" memory.weights.nonrepeating="30.9 MiB" memory.graph.full="48.0 MiB" memory.graph.partial="48.0 MiB" time=2024-07-31T09:23:57.975Z level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama37639419/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 1 3 --parallel 4 --port 33967" time=2024-07-31T09:23:57.976Z level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-31T09:23:57.976Z level=INFO source=server.go:584 msg="waiting for llama runner to start responding" time=2024-07-31T09:23:57.976Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="6eeaeba" tid="125558191894528" timestamp=1722417837 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="125558191894528" timestamp=1722417837 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="33967" tid="125558191894528" timestamp=1722417837 llama_model_loader: loaded meta data with 22 key-value pairs and 197 tensors from /root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = Dmeta-embedding-zh llama_model_loader: - kv 2: bert.block_count u32 = 12 llama_model_loader: - kv 3: bert.context_length u32 = 1024 llama_model_loader: - kv 4: bert.embedding_length u32 = 768 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 3072 llama_model_loader: - kv 6: bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: bert.attention.layer_norm _epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = bert llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,21128] = ["[PAD]", "[unused1]", "[unused2]", "... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,21128] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 16: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 18: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 19: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - type f32: 123 tensors llama_model_loader: - type f16: 74 tensors llm_load_vocab: special tokens cache size = 5 llm_load_vocab: token to piece cache size = 0.0769 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 21128 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 1024 llm_load_print_meta: n_embd = 768 llm_load_print_meta: n_layer = 12 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 768 llm_load_print_meta: n_embd_v_gqa = 768 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 3072 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 2 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 1024 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 109M llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 102.07 M llm_load_print_meta: model size = 194.92 MiB (16.02 BPW) llm_load_print_meta: general.name = Dmeta-embedding-zh llm_load_print_meta: BOS token = 0 '[PAD]' llm_load_print_meta: EOS token = 2 '[unused2]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' llm_load_print_meta: max token length = 48 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.16 MiB llm_load_tensors: offloading 12 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 13/13 layers to GPU llm_load_tensors: CPU buffer size = 32.46 MiB llm_load_tensors: CUDA0 buffer size = 162.46 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 288.00 MiB llama_new_context_with_model: KV self size = 288.00 MiB, K (f16): 144.00 MiB, V (f16): 144.00 MiB llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 19.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 4.00 MiB llama_new_context_with_model: graph nodes = 429 llama_new_context_with_model: graph splits = 2 time=2024-07-31T09:23:58.139Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.312995606 model=/root/.ollama/models/blobs/sha256-9b18b416fe232d5a834e15ce0d6cc353d7f6366423b8a7ef236db9ecee320527 time=2024-07-31T09:23:58.226Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server loading model" INFO [main] model loaded | tid="125558191894528" timestamp=1722417838 time=2024-07-31T09:23:58.729Z level=INFO source=server.go:623 msg="llama runner started in 0.75 seconds" [GIN] 2024/07/31 - 09:23:58 | 200 | 6.175518609s | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:58 | 200 | 6.173129645s | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/ 31 - 09:23:58 | 200 | 6.181901759s | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:58 | 200 | 6.217999442s | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:58 | 200 | 6.128390115s | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:58 | 200 | 139.275881ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:58 | 200 | 141.805964ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:58 | 200 | 147.553231ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:59 | 200 | 147.626781ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:59 | 200 | 90.649859ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:59 | 200 | 134.183906ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:59 | 200 | 100.703301ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:59 | 200 | 76.093064ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:59 | 200 | 139.579148ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:59 | 200 | 195.963998ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:59 | 200 | 184.951077ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:59 | 200 | 204.863879ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:59 | 200 | 93.607337ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:59 | 200 | 92.691741ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:59 | 200 | 122.460956ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:23:59 | 200 | 164.876363ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:26:50 | 200 | 93.430143ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:26:50 | 200 | 51.56662ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:26: 50 | 200 | 139.845262ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:26:50 | 200 | 48.229681ms | 10.234.218.0 | POST "/api/embeddings" INFO [update_slots] input truncated | n_ctx=2048 n_erase=1522 n_keep=0 n_left=2048 n_shift=1024 tid="125558191894528" timestamp=1722418010 [GIN] 2024/07/31 - 09:26:50 | 200 | 103.527766ms | 10.234.218.0 | POST "/api/embeddings" [GIN] 2024/07/31 - 09:26:50 | 500 | 138.709641ms | 10.234.218.0 | POST "/api/embeddings" time=2024-07-31T09:26:50.849Z level=INFO source=routes.go:426 msg="embedding generation failed: do embedding request: Post \"http://127.0.0.1:33967/embedding\": EOF" [GIN] 2024/07/31 - 09:37:35 | 200 | 19.4µs | 127.0.0.1 | GET "/api/version" ### OS Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.1
GiteaMirror added the bug label 2026-04-22 08:33:29 -05:00
Author
Owner

@royjhan commented on GitHub (Jul 31, 2024):

How did you produce this error? Do you get something similar when hitting api/embed?

<!-- gh-comment-id:2260963382 --> @royjhan commented on GitHub (Jul 31, 2024): How did you produce this error? Do you get something similar when hitting api/embed?
Author
Owner

@lyh007 commented on GitHub (Aug 7, 2024):

I Have the same problem

<!-- gh-comment-id:2272899374 --> @lyh007 commented on GitHub (Aug 7, 2024): I Have the same problem
Author
Owner

@FellowTraveler commented on GitHub (Aug 12, 2024):

@yeexiangzhen1001 @lyh007 Can you provide more details about this issue? Were you running multiple concurrent embeddings? Or only a single one? Do they all fail even running 1-by-1?

<!-- gh-comment-id:2282959940 --> @FellowTraveler commented on GitHub (Aug 12, 2024): @yeexiangzhen1001 @lyh007 Can you provide more details about this issue? Were you running multiple concurrent embeddings? Or only a single one? Do they all fail even running 1-by-1?
Author
Owner

@r0x07k commented on GitHub (Aug 14, 2024):

@FellowTraveler I believe I have the same problem. It occurs when OLLAMA_NUM_PARALLEL > 1.

If I set OLLAMA_NUM_PARALLEL=2 and run concurrent embedding generation, it works for some time. By observing the GPU load and process performance, I can confirm that Ollama runs concurrently. However, at some random point, it fails and restarts.

The issue never occurs with OLLAMA_NUM_PARALLEL=1 on the same dataset.

I am using WSL.

Logs:

Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 |  1.128451373s |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 |  1.066611096s |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 |  1.087745793s |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 |  1.029340586s |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 |   1.11723428s |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 |  941.225895ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 |  989.152974ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 |  969.737569ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 |  1.039815022s |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 |  1.018910443s |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:39 HOST ollama[203536]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/llama.cpp:12254: seq_id < n_tokens && "seq_id cannot be larger than n_tokens with pooling_type == MEAN"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.051-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: do embedding request: Post \"http://127.0.0.1:34119/embedding\": EOF"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  1.025392374s |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": read tcp 127.0.0.1:55594->127.0.0.1:34119: read: connection reset by peer"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  1.000137616s |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  959.997572ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  920.372552ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  840.913782ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  928.328085ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  775.353759ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  769.353247ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  699.744016ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  620.432543ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  560.559146ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  540.472535ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  541.755321ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  530.773609ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  420.587612ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  400.452235ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  380.769148ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  360.403793ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  230.564378ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |  199.634708ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.086-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused"
Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 |   140.82641ms |       127.0.0.1 | POST     "/api/embeddings"
Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.129-04:00 level=WARN source=server.go:475 msg="llama runner process no longer running" sys=6 string="signal: aborted"
Aug 13 20:43:45 HOST ollama[203536]: time=2024-08-13T20:43:45.219-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.090310008 model=/usr/share/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
Aug 13 20:43:45 HOST ollama[203536]: time=2024-08-13T20:43:45.394-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[11.0 GiB]" memory.required.full="883.9 MiB" memory.required.partial="883.9 MiB" memory.required.kv="48.0 MiB" memory.required.allocations="[883.9 MiB]" memory.weights.total="264.1 MiB" memory.weights.repeating="219.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="96.0 MiB" memory.graph.partial="96.0 MiB"
Aug 13 20:43:45 HOST ollama[203536]: time=2024-08-13T20:43:45.394-04:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama3873679269/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 16384 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --flash-attn --parallel 2 --port 35347"
<!-- gh-comment-id:2287586896 --> @r0x07k commented on GitHub (Aug 14, 2024): @FellowTraveler I believe I have the same problem. It occurs when OLLAMA_NUM_PARALLEL > 1. If I set OLLAMA_NUM_PARALLEL=2 and run concurrent embedding generation, it works for some time. By observing the GPU load and process performance, I can confirm that Ollama runs concurrently. However, at some random point, it fails and restarts. The issue never occurs with OLLAMA_NUM_PARALLEL=1 on the same dataset. I am using WSL. Logs: ``` Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 | 1.128451373s | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 | 1.066611096s | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 | 1.087745793s | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 | 1.029340586s | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 | 1.11723428s | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 | 941.225895ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 | 989.152974ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 | 969.737569ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 | 1.039815022s | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:39 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:39 | 200 | 1.018910443s | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:39 HOST ollama[203536]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/llama.cpp:12254: seq_id < n_tokens && "seq_id cannot be larger than n_tokens with pooling_type == MEAN" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.051-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: do embedding request: Post \"http://127.0.0.1:34119/embedding\": EOF" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 1.025392374s | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": read tcp 127.0.0.1:55594->127.0.0.1:34119: read: connection reset by peer" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 1.000137616s | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 959.997572ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 920.372552ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 840.913782ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 928.328085ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 775.353759ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 769.353247ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 699.744016ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 620.432543ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 560.559146ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 540.472535ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 541.755321ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 530.773609ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 420.587612ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 400.452235ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 380.769148ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 360.403793ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 230.564378ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.085-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 199.634708ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.086-04:00 level=INFO source=routes.go:400 msg="embedding generation failed: health resp: Get \"http://127.0.0.1:34119/health\": dial tcp 127.0.0.1:34119: connect: connection refused" Aug 13 20:43:40 HOST ollama[203536]: [GIN] 2024/08/13 - 20:43:40 | 500 | 140.82641ms | 127.0.0.1 | POST "/api/embeddings" Aug 13 20:43:40 HOST ollama[203536]: time=2024-08-13T20:43:40.129-04:00 level=WARN source=server.go:475 msg="llama runner process no longer running" sys=6 string="signal: aborted" Aug 13 20:43:45 HOST ollama[203536]: time=2024-08-13T20:43:45.219-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.090310008 model=/usr/share/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 Aug 13 20:43:45 HOST ollama[203536]: time=2024-08-13T20:43:45.394-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[11.0 GiB]" memory.required.full="883.9 MiB" memory.required.partial="883.9 MiB" memory.required.kv="48.0 MiB" memory.required.allocations="[883.9 MiB]" memory.weights.total="264.1 MiB" memory.weights.repeating="219.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="96.0 MiB" memory.graph.partial="96.0 MiB" Aug 13 20:43:45 HOST ollama[203536]: time=2024-08-13T20:43:45.394-04:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama3873679269/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 16384 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --flash-attn --parallel 2 --port 35347" ```
Author
Owner

@r0x07k commented on GitHub (Aug 14, 2024):

If I set OLLAMA_NUM_PARALLEL to 3 or higher, the issue occurs more quickly.

<!-- gh-comment-id:2287608628 --> @r0x07k commented on GitHub (Aug 14, 2024): If I set OLLAMA_NUM_PARALLEL to 3 or higher, the issue occurs more quickly.
Author
Owner

@CalebFenton commented on GitHub (Aug 17, 2024):

I'm also getting this error. I'm running them 1 by 1 using the OpenAI client. Any suggestions for troubleshooting would be appreciated. I'm next going to try with debug logs (for more info) and with really long strings (to try to reproduce).

<!-- gh-comment-id:2294595281 --> @CalebFenton commented on GitHub (Aug 17, 2024): I'm also getting this error. I'm running them 1 by 1 using the OpenAI client. Any suggestions for troubleshooting would be appreciated. I'm next going to try with debug logs (for more info) and with really long strings (to try to reproduce).
Author
Owner

@AndreasKarasenko commented on GitHub (Sep 2, 2024):

Similar issue for me. Although I'm also using OLLAMA_MAX_LOADED_MODELS=2 alongside OLLAMA_NUM_PARALLEL=2.
However the error does not appear for every dataset I use.

<!-- gh-comment-id:2323907511 --> @AndreasKarasenko commented on GitHub (Sep 2, 2024): Similar issue for me. Although I'm also using OLLAMA_MAX_LOADED_MODELS=2 alongside OLLAMA_NUM_PARALLEL=2. However the error does not appear for every dataset I use.
Author
Owner

@jmorganca commented on GitHub (Sep 2, 2024):

Hi folks this should be fixed in the latest versions of Ollama (0.3.7+). Let me know if you still encounter the issue.

<!-- gh-comment-id:2325382979 --> @jmorganca commented on GitHub (Sep 2, 2024): Hi folks this should be fixed in the latest versions of Ollama (0.3.7+). Let me know if you still encounter the issue.
Author
Owner

@marcochang1028 commented on GitHub (Sep 27, 2024):

I just pull the image of 0.3.12, but still encounter this issue.
[GIN] 2024/09/27 - 23:18:24 | 500 | 930.420752ms | 172.18.0.1 | POST "/api/embeddings"
time=2024-09-27T23:18:24.287+08:00 level=INFO source=routes.go:478 msg="embedding generation failed: do embedding request: Post "http://127.0.0.1:34061/embedding": EOF"

<!-- gh-comment-id:2379535612 --> @marcochang1028 commented on GitHub (Sep 27, 2024): I just pull the image of 0.3.12, but still encounter this issue. [GIN] 2024/09/27 - 23:18:24 | 500 | 930.420752ms | 172.18.0.1 | POST "/api/embeddings" time=2024-09-27T23:18:24.287+08:00 level=INFO source=routes.go:478 msg="embedding generation failed: do embedding request: Post \"http://127.0.0.1:34061/embedding\": EOF"
Author
Owner

@PierreMesure commented on GitHub (Oct 2, 2024):

Getting the problem with Ollama 0.3.12, OLLAMA_NUM_PARALLEL=1 and OLLAMA_MAX_LOADED_MODELS=1 (or with higher values).

ollama  | [GIN] 2024/10/02 - 16:32:38 | 200 |  719.013018ms |    172.21.0.150 | POST     "/api/embeddings"
ollama  | [GIN] 2024/10/02 - 16:32:38 | 200 |   60.250597ms |    172.21.0.150 | POST     "/api/embeddings"
ollama  | [GIN] 2024/10/02 - 16:32:38 | 200 |   102.17148ms |    172.21.0.150 | POST     "/api/embeddings"
ollama  | [GIN] 2024/10/02 - 16:32:38 | 200 |  100.411626ms |    172.21.0.150 | POST     "/api/embeddings"
ollama  | time=2024-10-02T16:32:38.833Z level=INFO source=routes.go:478 msg="embedding generation failed: do embedding request: Post \"http://127.0.0.1:39439/embedding\": EOF"
ollama  | [GIN] 2024/10/02 - 16:32:38 | 500 |  252.662178ms |    172.21.0.150 | POST     "/api/embeddings"

EDIT: I just tried with two other embedding models (nomic-embed-text and jeffh/intfloat-multilingual-e5-large:f32) and it works flawlessly. So the model was the culprit in the first place. It's a GGUF version of KBLab/sentence-bert-swedish-cased that I made myself so I'd be very thankful if someone could help me understand why it doesn't work. 😥

EDIT: We've now started getting the problem with "jeffh/intfloat-multilingual-e5-large:f32". 😭
@jmorganca, I think it would be good to reopen this issue. I'm happy to provide more details to recreate the problem.

<!-- gh-comment-id:2389114308 --> @PierreMesure commented on GitHub (Oct 2, 2024): Getting the problem with Ollama 0.3.12, OLLAMA_NUM_PARALLEL=1 and OLLAMA_MAX_LOADED_MODELS=1 (or with higher values). ``` ollama | [GIN] 2024/10/02 - 16:32:38 | 200 | 719.013018ms | 172.21.0.150 | POST "/api/embeddings" ollama | [GIN] 2024/10/02 - 16:32:38 | 200 | 60.250597ms | 172.21.0.150 | POST "/api/embeddings" ollama | [GIN] 2024/10/02 - 16:32:38 | 200 | 102.17148ms | 172.21.0.150 | POST "/api/embeddings" ollama | [GIN] 2024/10/02 - 16:32:38 | 200 | 100.411626ms | 172.21.0.150 | POST "/api/embeddings" ollama | time=2024-10-02T16:32:38.833Z level=INFO source=routes.go:478 msg="embedding generation failed: do embedding request: Post \"http://127.0.0.1:39439/embedding\": EOF" ollama | [GIN] 2024/10/02 - 16:32:38 | 500 | 252.662178ms | 172.21.0.150 | POST "/api/embeddings" ``` EDIT: I just tried with two other embedding models (nomic-embed-text and jeffh/intfloat-multilingual-e5-large:f32) and it works flawlessly. So the model was the culprit in the first place. It's a GGUF version of [KBLab/sentence-bert-swedish-cased](https://huggingface.co/KBLab/sentence-bert-swedish-cased) that [I made myself](https://huggingface.co/PierreMesure/sentence-bert-swedish-cased-gguf) so I'd be very thankful if someone could help me understand why it doesn't work. 😥 EDIT: We've now started getting the problem with "jeffh/intfloat-multilingual-e5-large:f32". 😭 @jmorganca, I think it would be good to reopen this issue. I'm happy to provide more details to recreate the problem.
Author
Owner

@FellowTraveler commented on GitHub (Oct 19, 2024):

If the model works sometimes, but fails occasionally while concurrent, then the software is the problem, not the model itself. Remember, the model is only a data file. Even if the model was CORRUPTED, the software should still handle that situation gracefully. And just because you see a symptom with one data file, but not another data file, doesn't mean the software itself is bug-free. Sometimes the same bug will express itself differently with different data files.

<!-- gh-comment-id:2423762459 --> @FellowTraveler commented on GitHub (Oct 19, 2024): If the model works _sometimes_, but fails occasionally while _concurrent_, then the **software** is the problem, not the model itself. Remember, the model is only a data file. Even if the model was CORRUPTED, the software should still handle that situation gracefully. And just because you see a symptom with one data file, but not another data file, doesn't mean the software itself is bug-free. Sometimes the same bug will express itself differently with different data files.
Author
Owner

@PierreMesure commented on GitHub (Oct 30, 2024):

I just did some tests with LlamaIndex, jeffh/intfloat-multilingual-e5-large-instruct:f32 with different Ollama versions and here are the results:

  • 0.3.6
  • 0.3.7
  • 0.3.10
  • 0.3.13
  • 0.3.14
  • 0.4.0-rc5

Could it be linked to the introduction of a new Go subprocess model runner advertised in the release notes since 0.3.13? @jmorganca

<!-- gh-comment-id:2446795342 --> @PierreMesure commented on GitHub (Oct 30, 2024): I just did some tests with LlamaIndex, `jeffh/intfloat-multilingual-e5-large-instruct:f32` with different Ollama versions and here are the results: - ✅ 0.3.6 - ✅ 0.3.7 - ✅ 0.3.10 - ✅ 0.3.13 - ❌ 0.3.14 - ❌ 0.4.0-rc5 Could it be linked to the introduction of a new Go subprocess model runner advertised in the release notes since 0.3.13? @jmorganca
Author
Owner

@jessegross commented on GitHub (Oct 30, 2024):

@PierreMesure You previously said that the issue is happening on 0.3.12 but then said that it works before 0.3.13? Is that just with a different model? As @FellowTraveler said, it's probably the same thing but could show up differently with different models.

The Go runner isn't used until 0.4.0, so that's unlikely to be the issue.

Can you post the full logs with OLLAMA_DEBUG set from when you see the problem? It's probably a different issue from when this was originally reported.

<!-- gh-comment-id:2448616875 --> @jessegross commented on GitHub (Oct 30, 2024): @PierreMesure You previously said that the issue is happening on 0.3.12 but then said that it works before 0.3.13? Is that just with a different model? As @FellowTraveler said, it's probably the same thing but could show up differently with different models. The Go runner isn't used until 0.4.0, so that's unlikely to be the issue. Can you post the full logs with OLLAMA_DEBUG set from when you see the problem? It's probably a different issue from when this was originally reported.
Author
Owner

@PierreMesure commented on GitHub (Oct 31, 2024):

That's true! Very weird indeed!

Here's a stacktrace with 0.3.14 and jeffh/intfloat-multilingual-e5-large-instruct:f32. It fails at the first embedding query:

2024/10/31 07:55:11 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:8 OLLAMA_ORIGINS:[chrome-extension://* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
2024-10-31T07:55:11.704Z level=INFO source=images.go:754 msg="total blobs: 108"
2024-10-31T07:55:11.706Z level=INFO source=images.go:761 msg="total unused blobs removed: 0"
2024-10-31T07:55:11.706Z level=INFO source=routes.go:1205 msg="Listening on [::]:11434 (version 0.3.14)"
2024-10-31T07:55:11.707Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server
2024-10-31T07:55:11.707Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server
2024-10-31T07:55:11.707Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server
2024-10-31T07:55:11.707Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server
2024-10-31T07:55:11.707Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server
2024-10-31T07:55:11.707Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11 cuda_v12 cpu]"
2024-10-31T07:55:11.707Z level=DEBUG source=common.go:50 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
2024-10-31T07:55:11.707Z level=DEBUG source=sched.go:105 msg="starting llm scheduler"
2024-10-31T07:55:11.707Z level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
2024-10-31T07:55:11.708Z level=DEBUG source=gpu.go:94 msg="searching for GPU discovery libraries for NVIDIA"
2024-10-31T07:55:11.708Z level=DEBUG source=gpu.go:505 msg="Searching for GPU library" name=libcuda.so*
2024-10-31T07:55:11.708Z level=DEBUG source=gpu.go:528 msg="gpu library search" globs="[/usr/lib/ollama/libcuda.so* /usr/local/nvidia/lib/libcuda.so* /usr/local/nvidia/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
2024-10-31T07:55:11.709Z level=DEBUG source=gpu.go:562 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.535.183.01]
CUDA driver version: 12.2
2024-10-31T07:55:11.712Z level=DEBUG source=gpu.go:129 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.535.183.01
[GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] CUDA totalMem 48669 mb
[GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] CUDA freeMem 48117 mb
[GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] Compute Capability 8.6
2024-10-31T07:55:11.798Z level=DEBUG source=amd_linux.go:416 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
2024-10-31T07:55:11.798Z level=INFO source=types.go:123 msg="inference compute" id=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 library=cuda variant=v12 compute=8.6 driver=12.2 name="NVIDIA RTX A6000" total="47.5 GiB" available="47.0 GiB"
2024-10-31T07:55:31.063Z level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="125.5 GiB" before.free="92.7 GiB" before.free_swap="24.6 GiB" now.total="125.5 GiB" now.free="92.4 GiB" now.free_swap="24.6 GiB"
CUDA driver version: 12.2
2024-10-31T07:55:31.150Z level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 name="NVIDIA RTX A6000" overhead="0 B" before.total="47.5 GiB" before.free="47.0 GiB" now.total="47.5 GiB" now.free="47.0 GiB" now.used="551.9 MiB"
releasing cuda driver library
2024-10-31T07:55:31.236Z level=DEBUG source=sched.go:224 msg="loading first model" model=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4
2024-10-31T07:55:31.236Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[47.0 GiB]"
2024-10-31T07:55:31.236Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 parallel=1 available=50454462464 required="2.6 GiB"
2024-10-31T07:55:31.236Z level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="125.5 GiB" before.free="92.4 GiB" before.free_swap="24.6 GiB" now.total="125.5 GiB" now.free="92.4 GiB" now.free_swap="24.6 GiB"
CUDA driver version: 12.2
2024-10-31T07:55:31.306Z level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 name="NVIDIA RTX A6000" overhead="0 B" before.total="47.5 GiB" before.free="47.0 GiB" now.total="47.5 GiB" now.free="47.0 GiB" now.used="551.9 MiB"
releasing cuda driver library
2024-10-31T07:55:31.306Z level=INFO source=server.go:105 msg="system memory" total="125.5 GiB" free="92.4 GiB" free_swap="24.6 GiB"
2024-10-31T07:55:31.306Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[47.0 GiB]"
2024-10-31T07:55:31.307Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[47.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.6 GiB" memory.required.partial="2.6 GiB" memory.required.kv="12.0 MiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.1 GiB" memory.weights.repeating="188.6 MiB" memory.weights.nonrepeating="976.6 MiB" memory.graph.full="32.0 MiB" memory.graph.partial="32.0 MiB"
2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server
2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server
2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server
2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server
2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server
2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server
2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server
2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server
2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server
2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server
2024-10-31T07:55:31.321Z level=INFO source=server.go:388 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 --ctx-size 2048 --batch-size 512 --embedding --n-gpu-layers 25 --verbose --threads 12 --parallel 1 --port 42793"
2024-10-31T07:55:31.321Z level=DEBUG source=server.go:405 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/runners/cuda_v12:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4]"
2024-10-31T07:55:31.326Z level=INFO source=sched.go:449 msg="loaded runners" count=1
2024-10-31T07:55:31.326Z level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
2024-10-31T07:55:31.327Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
INFO [main] starting c++ runner | tid="134620879941632" timestamp=1730361331
INFO [main] build info | build=10 commit="b45ed63" tid="134620879941632" timestamp=1730361331
INFO [main] system info | n_threads=12 n_threads_batch=12 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="134620879941632" timestamp=1730361331 total_threads=24
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="23" port="42793" tid="134620879941632" timestamp=1730361331
llama_model_loader: loaded meta data with 38 key-value pairs and 389 tensors from /root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = multilingual-e5-large-instruct
llama_model_loader: - kv   3:                       general.organization str              = Tmp
llama_model_loader: - kv   4:                           general.finetune str              = instruct
llama_model_loader: - kv   5:                           general.basename str              = intfloat-multilingual-e5
llama_model_loader: - kv   6:                         general.size_label str              = large
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                               general.tags arr[str,3]       = ["mteb", "sentence-transformers", "tr...
llama_model_loader: - kv   9:                          general.languages arr[str,94]      = ["multilingual", "af", "am", "ar", "a...
llama_model_loader: - kv  10:                           bert.block_count u32              = 24
llama_model_loader: - kv  11:                        bert.context_length u32              = 512
llama_model_loader: - kv  12:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv  13:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv  14:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv  15:          bert.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                          general.file_type u32              = 0
llama_model_loader: - kv  17:                      bert.attention.causal bool             = false
llama_model_loader: - kv  18:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = t5
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,250002]  = ["<s>", "<pad>", "</s>", "<unk>", ","...
2024-10-31T07:55:31.578Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,250002]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,250002]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:            tokenizer.ggml.add_space_prefix bool             = true
llama_model_loader: - kv  25:            tokenizer.ggml.token_type_count u32              = 1
llama_model_loader: - kv  26:    tokenizer.ggml.remove_extra_whitespaces bool             = true
llama_model_loader: - kv  27:        tokenizer.ggml.precompiled_charsmap arr[u8,237539]   = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  30:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  31:          tokenizer.ggml.seperator_token_id u32              = 2
llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  33:                tokenizer.ggml.cls_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.mask_token_id u32              = 250001
llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  389 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 4
llm_load_vocab: token to piece cache size = 2.1668 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = UGM
llm_load_print_meta: n_vocab          = 250002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 512
llm_load_print_meta: n_embd           = 1024
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4096
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 512
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 335M
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 558.84 M
llm_load_print_meta: model size       = 2.08 GiB (32.00 BPW) 
llm_load_print_meta: general.name     = multilingual-e5-large-instruct
llm_load_print_meta: BOS token        = 0 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: SEP token        = 2 '</s>'
llm_load_print_meta: PAD token        = 1 '<pad>'
llm_load_print_meta: CLS token        = 0 '<s>'
llm_load_print_meta: MASK token       = 250001 '[PAD250000]'
llm_load_print_meta: LF token         = 6 '▁'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.32 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors:        CPU buffer size =   978.57 MiB
llm_load_tensors:      CUDA0 buffer size =  1153.23 MiB
2024-10-31T07:55:32.331Z level=DEBUG source=server.go:632 msg="model load progress 0.63"
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    26.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     6.00 MiB
llama_new_context_with_model: graph nodes  = 851
llama_new_context_with_model: graph splits = 2
DEBUG [initialize] initializing slots | n_slots=1 tid="134620879941632" timestamp=1730361332
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="134620879941632" timestamp=1730361332
INFO [main] model loaded | tid="134620879941632" timestamp=1730361332
DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="134620879941632" timestamp=1730361332
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=0 tid="134620879941632" timestamp=1730361332
2024-10-31T07:55:32.583Z level=INFO source=server.go:626 msg="llama runner started in 1.26 seconds"
2024-10-31T07:55:32.584Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=1 tid="134620879941632" timestamp=1730361332
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2 tid="134620879941632" timestamp=1730361332
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=2 tid="134620879941632" timestamp=1730361332
/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml.c:13371: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
2024-10-31T07:55:32.865Z level=INFO source=routes.go:478 msg="embedding generation failed: do embedding request: Post \"http://127.0.0.1:42793/embedding\": EOF"
[GIN] 2024/10/31 - 07:55:32 | 500 |  1.804487083s |    172.21.0.150 | POST     "/api/embeddings"
2024-10-31T07:55:32.866Z level=DEBUG source=sched.go:466 msg="context for request finished"
2024-10-31T07:55:32.866Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 duration=5m0s
2024-10-31T07:55:32.866Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 refCount=0
2024-10-31T07:55:32.881Z level=DEBUG source=server.go:428 msg="llama runner terminated" error="signal: aborted (core dumped)"
<!-- gh-comment-id:2449258464 --> @PierreMesure commented on GitHub (Oct 31, 2024): That's true! Very weird indeed! Here's a stacktrace with 0.3.14 and [jeffh/intfloat-multilingual-e5-large-instruct:f32](https://ollama.com/[jeffh/intfloat-multilingual-e5-large-instruct:f32](https://ollama.com/jeffh/intfloat-multilingual-e5-large-instruct:f32)). It fails at the first embedding query: ```log 2024/10/31 07:55:11 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:8 OLLAMA_ORIGINS:[chrome-extension://* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 2024-10-31T07:55:11.704Z level=INFO source=images.go:754 msg="total blobs: 108" 2024-10-31T07:55:11.706Z level=INFO source=images.go:761 msg="total unused blobs removed: 0" 2024-10-31T07:55:11.706Z level=INFO source=routes.go:1205 msg="Listening on [::]:11434 (version 0.3.14)" 2024-10-31T07:55:11.707Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server 2024-10-31T07:55:11.707Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server 2024-10-31T07:55:11.707Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server 2024-10-31T07:55:11.707Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server 2024-10-31T07:55:11.707Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server 2024-10-31T07:55:11.707Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11 cuda_v12 cpu]" 2024-10-31T07:55:11.707Z level=DEBUG source=common.go:50 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" 2024-10-31T07:55:11.707Z level=DEBUG source=sched.go:105 msg="starting llm scheduler" 2024-10-31T07:55:11.707Z level=INFO source=gpu.go:221 msg="looking for compatible GPUs" 2024-10-31T07:55:11.708Z level=DEBUG source=gpu.go:94 msg="searching for GPU discovery libraries for NVIDIA" 2024-10-31T07:55:11.708Z level=DEBUG source=gpu.go:505 msg="Searching for GPU library" name=libcuda.so* 2024-10-31T07:55:11.708Z level=DEBUG source=gpu.go:528 msg="gpu library search" globs="[/usr/lib/ollama/libcuda.so* /usr/local/nvidia/lib/libcuda.so* /usr/local/nvidia/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" 2024-10-31T07:55:11.709Z level=DEBUG source=gpu.go:562 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.535.183.01] CUDA driver version: 12.2 2024-10-31T07:55:11.712Z level=DEBUG source=gpu.go:129 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.535.183.01 [GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] CUDA totalMem 48669 mb [GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] CUDA freeMem 48117 mb [GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] Compute Capability 8.6 2024-10-31T07:55:11.798Z level=DEBUG source=amd_linux.go:416 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library 2024-10-31T07:55:11.798Z level=INFO source=types.go:123 msg="inference compute" id=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 library=cuda variant=v12 compute=8.6 driver=12.2 name="NVIDIA RTX A6000" total="47.5 GiB" available="47.0 GiB" 2024-10-31T07:55:31.063Z level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="125.5 GiB" before.free="92.7 GiB" before.free_swap="24.6 GiB" now.total="125.5 GiB" now.free="92.4 GiB" now.free_swap="24.6 GiB" CUDA driver version: 12.2 2024-10-31T07:55:31.150Z level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 name="NVIDIA RTX A6000" overhead="0 B" before.total="47.5 GiB" before.free="47.0 GiB" now.total="47.5 GiB" now.free="47.0 GiB" now.used="551.9 MiB" releasing cuda driver library 2024-10-31T07:55:31.236Z level=DEBUG source=sched.go:224 msg="loading first model" model=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 2024-10-31T07:55:31.236Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[47.0 GiB]" 2024-10-31T07:55:31.236Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 parallel=1 available=50454462464 required="2.6 GiB" 2024-10-31T07:55:31.236Z level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="125.5 GiB" before.free="92.4 GiB" before.free_swap="24.6 GiB" now.total="125.5 GiB" now.free="92.4 GiB" now.free_swap="24.6 GiB" CUDA driver version: 12.2 2024-10-31T07:55:31.306Z level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 name="NVIDIA RTX A6000" overhead="0 B" before.total="47.5 GiB" before.free="47.0 GiB" now.total="47.5 GiB" now.free="47.0 GiB" now.used="551.9 MiB" releasing cuda driver library 2024-10-31T07:55:31.306Z level=INFO source=server.go:105 msg="system memory" total="125.5 GiB" free="92.4 GiB" free_swap="24.6 GiB" 2024-10-31T07:55:31.306Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[47.0 GiB]" 2024-10-31T07:55:31.307Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[47.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.6 GiB" memory.required.partial="2.6 GiB" memory.required.kv="12.0 MiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.1 GiB" memory.weights.repeating="188.6 MiB" memory.weights.nonrepeating="976.6 MiB" memory.graph.full="32.0 MiB" memory.graph.partial="32.0 MiB" 2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server 2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server 2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server 2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server 2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server 2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server 2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server 2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server 2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server 2024-10-31T07:55:31.307Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server 2024-10-31T07:55:31.321Z level=INFO source=server.go:388 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 --ctx-size 2048 --batch-size 512 --embedding --n-gpu-layers 25 --verbose --threads 12 --parallel 1 --port 42793" 2024-10-31T07:55:31.321Z level=DEBUG source=server.go:405 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/runners/cuda_v12:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4]" 2024-10-31T07:55:31.326Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-10-31T07:55:31.326Z level=INFO source=server.go:587 msg="waiting for llama runner to start responding" 2024-10-31T07:55:31.327Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" INFO [main] starting c++ runner | tid="134620879941632" timestamp=1730361331 INFO [main] build info | build=10 commit="b45ed63" tid="134620879941632" timestamp=1730361331 INFO [main] system info | n_threads=12 n_threads_batch=12 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="134620879941632" timestamp=1730361331 total_threads=24 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="23" port="42793" tid="134620879941632" timestamp=1730361331 llama_model_loader: loaded meta data with 38 key-value pairs and 389 tensors from /root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = multilingual-e5-large-instruct llama_model_loader: - kv 3: general.organization str = Tmp llama_model_loader: - kv 4: general.finetune str = instruct llama_model_loader: - kv 5: general.basename str = intfloat-multilingual-e5 llama_model_loader: - kv 6: general.size_label str = large llama_model_loader: - kv 7: general.license str = mit llama_model_loader: - kv 8: general.tags arr[str,3] = ["mteb", "sentence-transformers", "tr... llama_model_loader: - kv 9: general.languages arr[str,94] = ["multilingual", "af", "am", "ar", "a... llama_model_loader: - kv 10: bert.block_count u32 = 24 llama_model_loader: - kv 11: bert.context_length u32 = 512 llama_model_loader: - kv 12: bert.embedding_length u32 = 1024 llama_model_loader: - kv 13: bert.feed_forward_length u32 = 4096 llama_model_loader: - kv 14: bert.attention.head_count u32 = 16 llama_model_loader: - kv 15: bert.attention.layer_norm_epsilon f32 = 0.000010 llama_model_loader: - kv 16: general.file_type u32 = 0 llama_model_loader: - kv 17: bert.attention.causal bool = false llama_model_loader: - kv 18: bert.pooling_type u32 = 1 llama_model_loader: - kv 19: tokenizer.ggml.model str = t5 llama_model_loader: - kv 20: tokenizer.ggml.pre str = default llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,250002] = ["<s>", "<pad>", "</s>", "<unk>", ","... 2024-10-31T07:55:31.578Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,250002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,250002] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.add_space_prefix bool = true llama_model_loader: - kv 25: tokenizer.ggml.token_type_count u32 = 1 llama_model_loader: - kv 26: tokenizer.ggml.remove_extra_whitespaces bool = true llama_model_loader: - kv 27: tokenizer.ggml.precompiled_charsmap arr[u8,237539] = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,... llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 31: tokenizer.ggml.seperator_token_id u32 = 2 llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 1 llama_model_loader: - kv 33: tokenizer.ggml.cls_token_id u32 = 0 llama_model_loader: - kv 34: tokenizer.ggml.mask_token_id u32 = 250001 llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = true llama_model_loader: - kv 37: general.quantization_version u32 = 2 llama_model_loader: - type f32: 389 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 4 llm_load_vocab: token to piece cache size = 2.1668 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = UGM llm_load_print_meta: n_vocab = 250002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 1024 llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 4096 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 335M llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 558.84 M llm_load_print_meta: model size = 2.08 GiB (32.00 BPW) llm_load_print_meta: general.name = multilingual-e5-large-instruct llm_load_print_meta: BOS token = 0 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: SEP token = 2 '</s>' llm_load_print_meta: PAD token = 1 '<pad>' llm_load_print_meta: CLS token = 0 '<s>' llm_load_print_meta: MASK token = 250001 '[PAD250000]' llm_load_print_meta: LF token = 6 '▁' llm_load_print_meta: EOG token = 2 '</s>' llm_load_print_meta: max token length = 48 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.32 MiB llm_load_tensors: offloading 24 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 25/25 layers to GPU llm_load_tensors: CPU buffer size = 978.57 MiB llm_load_tensors: CUDA0 buffer size = 1153.23 MiB 2024-10-31T07:55:32.331Z level=DEBUG source=server.go:632 msg="model load progress 0.63" llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 192.00 MiB llama_new_context_with_model: KV self size = 192.00 MiB, K (f16): 96.00 MiB, V (f16): 96.00 MiB llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 26.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 6.00 MiB llama_new_context_with_model: graph nodes = 851 llama_new_context_with_model: graph splits = 2 DEBUG [initialize] initializing slots | n_slots=1 tid="134620879941632" timestamp=1730361332 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="134620879941632" timestamp=1730361332 INFO [main] model loaded | tid="134620879941632" timestamp=1730361332 DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="134620879941632" timestamp=1730361332 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=0 tid="134620879941632" timestamp=1730361332 2024-10-31T07:55:32.583Z level=INFO source=server.go:626 msg="llama runner started in 1.26 seconds" 2024-10-31T07:55:32.584Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=1 tid="134620879941632" timestamp=1730361332 DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2 tid="134620879941632" timestamp=1730361332 DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=2 tid="134620879941632" timestamp=1730361332 /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml.c:13371: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed 2024-10-31T07:55:32.865Z level=INFO source=routes.go:478 msg="embedding generation failed: do embedding request: Post \"http://127.0.0.1:42793/embedding\": EOF" [GIN] 2024/10/31 - 07:55:32 | 500 | 1.804487083s | 172.21.0.150 | POST "/api/embeddings" 2024-10-31T07:55:32.866Z level=DEBUG source=sched.go:466 msg="context for request finished" 2024-10-31T07:55:32.866Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 duration=5m0s 2024-10-31T07:55:32.866Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 refCount=0 2024-10-31T07:55:32.881Z level=DEBUG source=server.go:428 msg="llama runner terminated" error="signal: aborted (core dumped)" ```
Author
Owner

@PierreMesure commented on GitHub (Oct 31, 2024):

Here is another one with 0.3.13 and sentence-bert-swedish-cased:

2024/10/31 08:29:32 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:8 OLLAMA_ORIGINS:[chrome-extension://* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
2024-10-31T08:29:32.116Z level=INFO source=images.go:754 msg="total blobs: 108"
2024-10-31T08:29:32.118Z level=INFO source=images.go:761 msg="total unused blobs removed: 0"
2024-10-31T08:29:32.119Z level=INFO source=routes.go:1205 msg="Listening on [::]:11434 (version 0.3.13)"
2024-10-31T08:29:32.119Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server
2024-10-31T08:29:32.119Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server
2024-10-31T08:29:32.119Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server
2024-10-31T08:29:32.119Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server
2024-10-31T08:29:32.119Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server
2024-10-31T08:29:32.119Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12]"
2024-10-31T08:29:32.119Z level=DEBUG source=common.go:50 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
2024-10-31T08:29:32.119Z level=DEBUG source=sched.go:105 msg="starting llm scheduler"
2024-10-31T08:29:32.119Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
2024-10-31T08:29:32.119Z level=DEBUG source=gpu.go:86 msg="searching for GPU discovery libraries for NVIDIA"
2024-10-31T08:29:32.119Z level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so*
2024-10-31T08:29:32.119Z level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[/usr/lib/ollama/libcuda.so* /usr/local/nvidia/lib/libcuda.so* /usr/local/nvidia/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
2024-10-31T08:29:32.166Z level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.535.183.01]
CUDA driver version: 12.2
2024-10-31T08:29:32.169Z level=DEBUG source=gpu.go:118 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.535.183.01
[GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] CUDA totalMem 48669 mb
[GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] CUDA freeMem 48117 mb
[GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] Compute Capability 8.6
2024-10-31T08:29:32.260Z level=DEBUG source=amd_linux.go:376 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
2024-10-31T08:29:32.260Z level=INFO source=types.go:107 msg="inference compute" id=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 library=cuda variant=v12 compute=8.6 driver=12.2 name="NVIDIA RTX A6000" total="47.5 GiB" available="47.0 GiB"
2024-10-31T08:29:43.861Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="125.5 GiB" before.free="91.0 GiB" before.free_swap="24.6 GiB" now.total="125.5 GiB" now.free="90.7 GiB" now.free_swap="24.6 GiB"
CUDA driver version: 12.2
2024-10-31T08:29:43.945Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 name="NVIDIA RTX A6000" overhead="0 B" before.total="47.5 GiB" before.free="47.0 GiB" now.total="47.5 GiB" now.free="47.0 GiB" now.used="551.9 MiB"
releasing cuda driver library
2024-10-31T08:29:43.954Z level=DEBUG source=sched.go:224 msg="loading first model" model=/root/.ollama/models/blobs/sha256-9ee1505b22d4bc8d192095f924ddb62bc4783a48fbd411252310933e879930f8
2024-10-31T08:29:43.954Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[47.0 GiB]"
2024-10-31T08:29:43.954Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-9ee1505b22d4bc8d192095f924ddb62bc4783a48fbd411252310933e879930f8 gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 parallel=1 available=50454462464 required="974.4 MiB"
2024-10-31T08:29:43.954Z level=INFO source=server.go:108 msg="system memory" total="125.5 GiB" free="90.7 GiB" free_swap="24.6 GiB"
2024-10-31T08:29:43.954Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[47.0 GiB]"
2024-10-31T08:29:43.955Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[47.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="974.4 MiB" memory.required.partial="974.4 MiB" memory.required.kv="6.0 MiB" memory.required.allocations="[974.4 MiB]" memory.weights.total="330.5 MiB" memory.weights.repeating="183.0 MiB" memory.weights.nonrepeating="147.4 MiB" memory.graph.full="12.0 MiB" memory.graph.partial="12.0 MiB"
2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server
2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server
2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server
2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server
2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server
2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server
2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server
2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server
2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server
2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server
2024-10-31T08:29:43.966Z level=INFO source=server.go:399 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-9ee1505b22d4bc8d192095f924ddb62bc4783a48fbd411252310933e879930f8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --parallel 1 --port 38379"
2024-10-31T08:29:43.966Z level=DEBUG source=server.go:416 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/runners/cuda_v12:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4]"
2024-10-31T08:29:43.966Z level=INFO source=sched.go:449 msg="loaded runners" count=1
2024-10-31T08:29:43.966Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
2024-10-31T08:29:43.966Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
INFO [main] starting c++ runner | tid="123204351016960" timestamp=1730363384
INFO [main] build info | build=10 commit="9794cea" tid="123204351016960" timestamp=1730363384
INFO [main] system info | n_threads=12 n_threads_batch=12 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="123204351016960" timestamp=1730363384 total_threads=24
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="23" port="38379" tid="123204351016960" timestamp=1730363384
llama_model_loader: loaded meta data with 27 key-value pairs and 197 tensors from /root/.ollama/models/blobs/sha256-9ee1505b22d4bc8d192095f924ddb62bc4783a48fbd411252310933e879930f8 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Sentence Bert Swedish Cased
llama_model_loader: - kv   3:                         general.size_label str              = 124M
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                               general.tags arr[str,5]       = ["sentence-transformers", "feature-ex...
llama_model_loader: - kv   6:                          general.languages arr[str,1]       = ["sv"]
llama_model_loader: - kv   7:                           bert.block_count u32              = 12
llama_model_loader: - kv   8:                        bert.context_length u32              = 512
llama_model_loader: - kv   9:                      bert.embedding_length u32              = 768
llama_model_loader: - kv  10:                   bert.feed_forward_length u32              = 3072
llama_model_loader: - kv  11:                  bert.attention.head_count u32              = 12
llama_model_loader: - kv  12:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv  13:                          general.file_type u32              = 0
llama_model_loader: - kv  14:                      bert.attention.causal bool             = false
llama_model_loader: - kv  15:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  16:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = sentence-bert-swedish-cased
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,50325]   = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,50325]   = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 1
llama_model_loader: - kv  22:          tokenizer.ggml.seperator_token_id u32              = 3
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  24:                tokenizer.ggml.cls_token_id u32              = 2
llama_model_loader: - kv  25:               tokenizer.ggml.mask_token_id u32              = 4
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  197 tensors
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.3416 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 50325
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 512
llm_load_print_meta: n_embd           = 768
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 768
llm_load_print_meta: n_embd_v_gqa     = 768
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 3072
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 512
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 109M
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 124.10 M
llm_load_print_meta: model size       = 473.41 MiB (32.00 BPW) 
llm_load_print_meta: general.name     = Sentence Bert Swedish Cased
llm_load_print_meta: UNK token        = 1 '[UNK]'
llm_load_print_meta: SEP token        = 3 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 2 '[CLS]'
llm_load_print_meta: MASK token       = 4 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
llm_load_print_meta: max token length = 20
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.16 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 13/13 layers to GPU
llm_load_tensors:        CPU buffer size =   148.94 MiB
llm_load_tensors:      CUDA0 buffer size =   324.46 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    72.00 MiB
llama_new_context_with_model: KV self size  =   72.00 MiB, K (f16):   36.00 MiB, V (f16):   36.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    20.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     5.00 MiB
llama_new_context_with_model: graph nodes  = 431
llama_new_context_with_model: graph splits = 2
2024-10-31T08:29:44.218Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
2024-10-31T08:29:44.218Z level=DEBUG source=server.go:643 msg="model load progress 1.00"
DEBUG [initialize] initializing slots | n_slots=1 tid="123204351016960" timestamp=1730363384
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="123204351016960" timestamp=1730363384
INFO [main] model loaded | tid="123204351016960" timestamp=1730363384
DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="123204351016960" timestamp=1730363384
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=0 tid="123204351016960" timestamp=1730363384
2024-10-31T08:29:44.470Z level=INFO source=server.go:637 msg="llama runner started in 0.50 seconds"
2024-10-31T08:29:44.470Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-9ee1505b22d4bc8d192095f924ddb62bc4783a48fbd411252310933e879930f8
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=1 tid="123204351016960" timestamp=1730363384
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2 tid="123204351016960" timestamp=1730363384
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=2 tid="123204351016960" timestamp=1730363384
2024-10-31T08:29:44.655Z level=INFO source=routes.go:478 msg="embedding generation failed: do embedding request: Post \"http://127.0.0.1:38379/embedding\": EOF"
[GIN] 2024/10/31 - 08:29:44 | 500 |  796.579724ms |    172.21.0.150 | POST     "/api/embeddings"
<!-- gh-comment-id:2449312861 --> @PierreMesure commented on GitHub (Oct 31, 2024): Here is another one with 0.3.13 and [sentence-bert-swedish-cased](https://huggingface.co/PierreMesure/sentence-bert-swedish-cased-gguf): ```log 2024/10/31 08:29:32 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:8 OLLAMA_ORIGINS:[chrome-extension://* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 2024-10-31T08:29:32.116Z level=INFO source=images.go:754 msg="total blobs: 108" 2024-10-31T08:29:32.118Z level=INFO source=images.go:761 msg="total unused blobs removed: 0" 2024-10-31T08:29:32.119Z level=INFO source=routes.go:1205 msg="Listening on [::]:11434 (version 0.3.13)" 2024-10-31T08:29:32.119Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server 2024-10-31T08:29:32.119Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server 2024-10-31T08:29:32.119Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server 2024-10-31T08:29:32.119Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server 2024-10-31T08:29:32.119Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server 2024-10-31T08:29:32.119Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12]" 2024-10-31T08:29:32.119Z level=DEBUG source=common.go:50 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" 2024-10-31T08:29:32.119Z level=DEBUG source=sched.go:105 msg="starting llm scheduler" 2024-10-31T08:29:32.119Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs" 2024-10-31T08:29:32.119Z level=DEBUG source=gpu.go:86 msg="searching for GPU discovery libraries for NVIDIA" 2024-10-31T08:29:32.119Z level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so* 2024-10-31T08:29:32.119Z level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[/usr/lib/ollama/libcuda.so* /usr/local/nvidia/lib/libcuda.so* /usr/local/nvidia/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" 2024-10-31T08:29:32.166Z level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.535.183.01] CUDA driver version: 12.2 2024-10-31T08:29:32.169Z level=DEBUG source=gpu.go:118 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.535.183.01 [GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] CUDA totalMem 48669 mb [GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] CUDA freeMem 48117 mb [GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] Compute Capability 8.6 2024-10-31T08:29:32.260Z level=DEBUG source=amd_linux.go:376 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library 2024-10-31T08:29:32.260Z level=INFO source=types.go:107 msg="inference compute" id=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 library=cuda variant=v12 compute=8.6 driver=12.2 name="NVIDIA RTX A6000" total="47.5 GiB" available="47.0 GiB" 2024-10-31T08:29:43.861Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="125.5 GiB" before.free="91.0 GiB" before.free_swap="24.6 GiB" now.total="125.5 GiB" now.free="90.7 GiB" now.free_swap="24.6 GiB" CUDA driver version: 12.2 2024-10-31T08:29:43.945Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 name="NVIDIA RTX A6000" overhead="0 B" before.total="47.5 GiB" before.free="47.0 GiB" now.total="47.5 GiB" now.free="47.0 GiB" now.used="551.9 MiB" releasing cuda driver library 2024-10-31T08:29:43.954Z level=DEBUG source=sched.go:224 msg="loading first model" model=/root/.ollama/models/blobs/sha256-9ee1505b22d4bc8d192095f924ddb62bc4783a48fbd411252310933e879930f8 2024-10-31T08:29:43.954Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[47.0 GiB]" 2024-10-31T08:29:43.954Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-9ee1505b22d4bc8d192095f924ddb62bc4783a48fbd411252310933e879930f8 gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 parallel=1 available=50454462464 required="974.4 MiB" 2024-10-31T08:29:43.954Z level=INFO source=server.go:108 msg="system memory" total="125.5 GiB" free="90.7 GiB" free_swap="24.6 GiB" 2024-10-31T08:29:43.954Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[47.0 GiB]" 2024-10-31T08:29:43.955Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[47.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="974.4 MiB" memory.required.partial="974.4 MiB" memory.required.kv="6.0 MiB" memory.required.allocations="[974.4 MiB]" memory.weights.total="330.5 MiB" memory.weights.repeating="183.0 MiB" memory.weights.nonrepeating="147.4 MiB" memory.graph.full="12.0 MiB" memory.graph.partial="12.0 MiB" 2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server 2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server 2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server 2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server 2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server 2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server 2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server 2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server 2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server 2024-10-31T08:29:43.955Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server 2024-10-31T08:29:43.966Z level=INFO source=server.go:399 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-9ee1505b22d4bc8d192095f924ddb62bc4783a48fbd411252310933e879930f8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --parallel 1 --port 38379" 2024-10-31T08:29:43.966Z level=DEBUG source=server.go:416 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/runners/cuda_v12:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4]" 2024-10-31T08:29:43.966Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-10-31T08:29:43.966Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" 2024-10-31T08:29:43.966Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error" INFO [main] starting c++ runner | tid="123204351016960" timestamp=1730363384 INFO [main] build info | build=10 commit="9794cea" tid="123204351016960" timestamp=1730363384 INFO [main] system info | n_threads=12 n_threads_batch=12 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="123204351016960" timestamp=1730363384 total_threads=24 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="23" port="38379" tid="123204351016960" timestamp=1730363384 llama_model_loader: loaded meta data with 27 key-value pairs and 197 tensors from /root/.ollama/models/blobs/sha256-9ee1505b22d4bc8d192095f924ddb62bc4783a48fbd411252310933e879930f8 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Sentence Bert Swedish Cased llama_model_loader: - kv 3: general.size_label str = 124M llama_model_loader: - kv 4: general.license str = apache-2.0 llama_model_loader: - kv 5: general.tags arr[str,5] = ["sentence-transformers", "feature-ex... llama_model_loader: - kv 6: general.languages arr[str,1] = ["sv"] llama_model_loader: - kv 7: bert.block_count u32 = 12 llama_model_loader: - kv 8: bert.context_length u32 = 512 llama_model_loader: - kv 9: bert.embedding_length u32 = 768 llama_model_loader: - kv 10: bert.feed_forward_length u32 = 3072 llama_model_loader: - kv 11: bert.attention.head_count u32 = 12 llama_model_loader: - kv 12: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 13: general.file_type u32 = 0 llama_model_loader: - kv 14: bert.attention.causal bool = false llama_model_loader: - kv 15: bert.pooling_type u32 = 1 llama_model_loader: - kv 16: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 17: tokenizer.ggml.model str = bert llama_model_loader: - kv 18: tokenizer.ggml.pre str = sentence-bert-swedish-cased llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,50325] = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,50325] = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 1 llama_model_loader: - kv 22: tokenizer.ggml.seperator_token_id u32 = 3 llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 24: tokenizer.ggml.cls_token_id u32 = 2 llama_model_loader: - kv 25: tokenizer.ggml.mask_token_id u32 = 4 llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 197 tensors llm_load_vocab: special tokens cache size = 5 llm_load_vocab: token to piece cache size = 0.3416 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 50325 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 768 llm_load_print_meta: n_layer = 12 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 768 llm_load_print_meta: n_embd_v_gqa = 768 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 3072 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 109M llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 124.10 M llm_load_print_meta: model size = 473.41 MiB (32.00 BPW) llm_load_print_meta: general.name = Sentence Bert Swedish Cased llm_load_print_meta: UNK token = 1 '[UNK]' llm_load_print_meta: SEP token = 3 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 2 '[CLS]' llm_load_print_meta: MASK token = 4 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' llm_load_print_meta: max token length = 20 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.16 MiB llm_load_tensors: offloading 12 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 13/13 layers to GPU llm_load_tensors: CPU buffer size = 148.94 MiB llm_load_tensors: CUDA0 buffer size = 324.46 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 72.00 MiB llama_new_context_with_model: KV self size = 72.00 MiB, K (f16): 36.00 MiB, V (f16): 36.00 MiB llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 20.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 5.00 MiB llama_new_context_with_model: graph nodes = 431 llama_new_context_with_model: graph splits = 2 2024-10-31T08:29:44.218Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" 2024-10-31T08:29:44.218Z level=DEBUG source=server.go:643 msg="model load progress 1.00" DEBUG [initialize] initializing slots | n_slots=1 tid="123204351016960" timestamp=1730363384 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="123204351016960" timestamp=1730363384 INFO [main] model loaded | tid="123204351016960" timestamp=1730363384 DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="123204351016960" timestamp=1730363384 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=0 tid="123204351016960" timestamp=1730363384 2024-10-31T08:29:44.470Z level=INFO source=server.go:637 msg="llama runner started in 0.50 seconds" 2024-10-31T08:29:44.470Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-9ee1505b22d4bc8d192095f924ddb62bc4783a48fbd411252310933e879930f8 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=1 tid="123204351016960" timestamp=1730363384 DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2 tid="123204351016960" timestamp=1730363384 DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=2 tid="123204351016960" timestamp=1730363384 2024-10-31T08:29:44.655Z level=INFO source=routes.go:478 msg="embedding generation failed: do embedding request: Post \"http://127.0.0.1:38379/embedding\": EOF" [GIN] 2024/10/31 - 08:29:44 | 500 | 796.579724ms | 172.21.0.150 | POST "/api/embeddings" ```
Author
Owner

@PierreMesure commented on GitHub (Oct 31, 2024):

And a final one using 0.4.0-rc5 and jeffh/intfloat-multilingual-e5-large-instruct:f32 . The stacktrace is different and I removed the text that was embedded (content="XXX"). It also failed at the first embedding.

2024/10/31 08:35:27 routes.go:1170: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:8 OLLAMA_ORIGINS:[chrome-extension://* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
2024-10-31T08:35:27.838Z level=INFO source=images.go:754 msg="total blobs: 108"
2024-10-31T08:35:27.841Z level=INFO source=images.go:761 msg="total unused blobs removed: 0"
2024-10-31T08:35:27.843Z level=INFO source=routes.go:1217 msg="Listening on [::]:11434 (version 0.4.0-rc5)"
2024-10-31T08:35:27.843Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server
2024-10-31T08:35:27.843Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server
2024-10-31T08:35:27.843Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server
2024-10-31T08:35:27.843Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server
2024-10-31T08:35:27.843Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server
2024-10-31T08:35:27.843Z level=INFO source=common.go:82 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12]"
2024-10-31T08:35:27.843Z level=DEBUG source=common.go:83 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
2024-10-31T08:35:27.843Z level=DEBUG source=sched.go:106 msg="starting llm scheduler"
2024-10-31T08:35:27.843Z level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
2024-10-31T08:35:27.845Z level=DEBUG source=gpu.go:94 msg="searching for GPU discovery libraries for NVIDIA"
2024-10-31T08:35:27.845Z level=DEBUG source=gpu.go:505 msg="Searching for GPU library" name=libcuda.so*
2024-10-31T08:35:27.845Z level=DEBUG source=gpu.go:528 msg="gpu library search" globs="[/usr/lib/ollama/libcuda.so* /usr/local/nvidia/lib/libcuda.so* /usr/local/nvidia/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
2024-10-31T08:35:27.873Z level=DEBUG source=gpu.go:562 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.535.183.01]
CUDA driver version: 12.2
2024-10-31T08:35:27.879Z level=DEBUG source=gpu.go:129 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.535.183.01
[GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] CUDA totalMem 48669 mb
[GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] CUDA freeMem 48117 mb
[GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] Compute Capability 8.6
2024-10-31T08:35:27.975Z level=DEBUG source=amd_linux.go:416 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
2024-10-31T08:35:27.975Z level=INFO source=types.go:123 msg="inference compute" id=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 library=cuda variant=v12 compute=8.6 driver=12.2 name="NVIDIA RTX A6000" total="47.5 GiB" available="47.0 GiB"
2024-10-31T08:35:31.046Z level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="125.5 GiB" before.free="90.4 GiB" before.free_swap="24.6 GiB" now.total="125.5 GiB" now.free="90.4 GiB" now.free_swap="24.6 GiB"
CUDA driver version: 12.2
2024-10-31T08:35:31.129Z level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 name="NVIDIA RTX A6000" overhead="0 B" before.total="47.5 GiB" before.free="47.0 GiB" now.total="47.5 GiB" now.free="47.0 GiB" now.used="551.9 MiB"
releasing cuda driver library
2024-10-31T08:35:31.226Z level=DEBUG source=sched.go:225 msg="loading first model" model=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4
2024-10-31T08:35:31.226Z level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[47.0 GiB]"
2024-10-31T08:35:31.226Z level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 parallel=1 available=50454462464 required="2.6 GiB"
2024-10-31T08:35:31.226Z level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="125.5 GiB" before.free="90.4 GiB" before.free_swap="24.6 GiB" now.total="125.5 GiB" now.free="90.4 GiB" now.free_swap="24.6 GiB"
CUDA driver version: 12.2
2024-10-31T08:35:31.302Z level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 name="NVIDIA RTX A6000" overhead="0 B" before.total="47.5 GiB" before.free="47.0 GiB" now.total="47.5 GiB" now.free="47.0 GiB" now.used="551.9 MiB"
releasing cuda driver library
2024-10-31T08:35:31.302Z level=INFO source=llama-server.go:72 msg="system memory" total="125.5 GiB" free="90.4 GiB" free_swap="24.6 GiB"
2024-10-31T08:35:31.302Z level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[47.0 GiB]"
2024-10-31T08:35:31.303Z level=INFO source=memory.go:346 msg="offload to cuda" layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[47.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.6 GiB" memory.required.partial="2.6 GiB" memory.required.kv="12.0 MiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.1 GiB" memory.weights.repeating="188.6 MiB" memory.weights.nonrepeating="976.6 MiB" memory.graph.full="32.0 MiB" memory.graph.partial="32.0 MiB"
2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server
2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server
2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server
2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server
2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server
2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server
2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server
2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server
2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server
2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server
2024-10-31T08:35:31.303Z level=INFO source=llama-server.go:355 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 --ctx-size 2048 --batch-size 512 --embedding --n-gpu-layers 25 --verbose --threads 12 --parallel 1 --port 43231"
2024-10-31T08:35:31.304Z level=DEBUG source=llama-server.go:372 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/runners/cuda_v12:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4]"
2024-10-31T08:35:31.304Z level=INFO source=sched.go:450 msg="loaded runners" count=1
2024-10-31T08:35:31.304Z level=INFO source=llama-server.go:534 msg="waiting for llama runner to start responding"
2024-10-31T08:35:31.304Z level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server error"
2024-10-31T08:35:31.341Z level=INFO source=runner.go:869 msg="starting go runner"
2024-10-31T08:35:31.341Z level=DEBUG source=runner.go:870 msg="system info" cpu="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " threads=12
2024-10-31T08:35:31.341Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:43231"
llama_model_loader: loaded meta data with 38 key-value pairs and 389 tensors from /root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = multilingual-e5-large-instruct
llama_model_loader: - kv   3:                       general.organization str              = Tmp
llama_model_loader: - kv   4:                           general.finetune str              = instruct
llama_model_loader: - kv   5:                           general.basename str              = intfloat-multilingual-e5
llama_model_loader: - kv   6:                         general.size_label str              = large
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                               general.tags arr[str,3]       = ["mteb", "sentence-transformers", "tr...
llama_model_loader: - kv   9:                          general.languages arr[str,94]      = ["multilingual", "af", "am", "ar", "a...
llama_model_loader: - kv  10:                           bert.block_count u32              = 24
llama_model_loader: - kv  11:                        bert.context_length u32              = 512
llama_model_loader: - kv  12:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv  13:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv  14:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv  15:          bert.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                          general.file_type u32              = 0
llama_model_loader: - kv  17:                      bert.attention.causal bool             = false
llama_model_loader: - kv  18:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = t5
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,250002]  = ["<s>", "<pad>", "</s>", "<unk>", ","...
2024-10-31T08:35:31.556Z level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,250002]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,250002]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:            tokenizer.ggml.add_space_prefix bool             = true
llama_model_loader: - kv  25:            tokenizer.ggml.token_type_count u32              = 1
llama_model_loader: - kv  26:    tokenizer.ggml.remove_extra_whitespaces bool             = true
llama_model_loader: - kv  27:        tokenizer.ggml.precompiled_charsmap arr[u8,237539]   = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  30:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  31:          tokenizer.ggml.seperator_token_id u32              = 2
llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  33:                tokenizer.ggml.cls_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.mask_token_id u32              = 250001
llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  389 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 4
llm_load_vocab: token to piece cache size = 2.1668 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = UGM
llm_load_print_meta: n_vocab          = 250002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 512
llm_load_print_meta: n_embd           = 1024
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4096
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 512
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 335M
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 558.84 M
llm_load_print_meta: model size       = 2.08 GiB (32.00 BPW) 
llm_load_print_meta: general.name     = multilingual-e5-large-instruct
llm_load_print_meta: BOS token        = 0 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: SEP token        = 2 '</s>'
llm_load_print_meta: PAD token        = 1 '<pad>'
llm_load_print_meta: CLS token        = 0 '<s>'
llm_load_print_meta: MASK token       = 250001 '[PAD250000]'
llm_load_print_meta: LF token         = 6 '▁'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.32 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors:        CPU buffer size =   978.57 MiB
llm_load_tensors:      CUDA0 buffer size =  1153.23 MiB
2024-10-31T08:35:32.310Z level=DEBUG source=llama-server.go:579 msg="model load progress 0.65"
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    26.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     6.00 MiB
llama_new_context_with_model: graph nodes  = 851
llama_new_context_with_model: graph splits = 2
2024-10-31T08:35:32.561Z level=INFO source=llama-server.go:573 msg="llama runner started in 1.26 seconds"
2024-10-31T08:35:32.561Z level=DEBUG source=sched.go:463 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4
2024-10-31T08:35:32.562Z level=DEBUG source=runner.go:713 msg="embedding request" content="XXX"
2024-10-31T08:35:32.565Z level=DEBUG source=cache.go:105 msg="loading cache slot" id=0 cache=0 prompt=688 used=0 remaining=688
ggml.c:13425: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
SIGSEGV: segmentation violation
PC=0x73b9424600d7 m=3 sigcode=1 addr=0x204a03fe0
signal arrived during cgo execution

goroutine 7 gp=0xc0000fc000 m=3 mp=0xc000073008 [syscall]:
runtime.cgocall(0x5afe89c8f7f0, 0xc000080ad8)
      runtime/cgocall.go:157 +0x4b fp=0xc000080ab0 sp=0xc000080a78 pc=0x5afe89a132cb
github.com/ollama/ollama/llama._Cfunc_llama_decode(0x73b8c80064a0, {0xb0, 0x73b8c8132a40, 0x0, 0x73b8c8133250, 0x73b8c8133a60, 0x73b8c8134270, 0x73b8a4eae910, 0x0, 0x0, ...})
      _cgo_gotypes.go:512 +0x4f fp=0xc000080ad8 sp=0xc000080ab0 pc=0x5afe89b10b0f
github.com/ollama/ollama/llama.(*Context).Decode.func1(0x5afe89a23178?, 0x3ff?)
      github.com/ollama/ollama/llama/llama.go:124 +0x11e fp=0xc000080be0 sp=0xc000080ad8 pc=0x5afe89b129be
github.com/ollama/ollama/llama.(*Context).Decode(0xc000262808?, 0xc000080ef8?)
      github.com/ollama/ollama/llama/llama.go:124 +0x17 fp=0xc000080c28 sp=0xc000080be0 pc=0x5afe89b127d7
main.(*Server).processBatch(0xc0000cc120, 0xc000080ef8, 0xc000080e90)
      github.com/ollama/ollama/llama/runner/runner.go:434 +0x285 fp=0xc000080df8 sp=0xc000080c28 pc=0x5afe89c8a9a5
main.(*Server).run(0xc0000cc120, {0x5afe89fc7cc8, 0xc0000a2050})
      github.com/ollama/ollama/llama/runner/runner.go:352 +0x359 fp=0xc000080fb8 sp=0xc000080df8 pc=0x5afe89c8a279
main.main.gowrap2()
      github.com/ollama/ollama/llama/runner/runner.go:907 +0x28 fp=0xc000080fe0 sp=0xc000080fb8 pc=0x5afe89c8e988
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x5afe89a7bce1
created by main.main in goroutine 1
      github.com/ollama/ollama/llama/runner/runner.go:907 +0xcab

goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
runtime.gopark(0x1?, 0xc00002b908?, 0xf4?, 0x9c?, 0xc00002b8e8?)
      runtime/proc.go:402 +0xce fp=0xc00002b888 sp=0xc00002b868 pc=0x5afe89a49f0e
runtime.netpollblock(0x10?, 0x89a12a26?, 0xfe?)
      runtime/netpoll.go:573 +0xf7 fp=0xc00002b8c0 sp=0xc00002b888 pc=0x5afe89a42157
internal/poll.runtime_pollWait(0x73b93aa68020, 0x72)
      runtime/netpoll.go:345 +0x85 fp=0xc00002b8e0 sp=0xc00002b8c0 pc=0x5afe89a769a5
internal/poll.(*pollDesc).wait(0x3?, 0x73b93aaa3e88?, 0x0)
      internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00002b908 sp=0xc00002b8e0 pc=0x5afe89ac6c87
internal/poll.(*pollDesc).waitRead(...)
      internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc0000f8080)
      internal/poll/fd_unix.go:611 +0x2ac fp=0xc00002b9b0 sp=0xc00002b908 pc=0x5afe89ac814c
net.(*netFD).accept(0xc0000f8080)
      net/fd_unix.go:172 +0x29 fp=0xc00002ba68 sp=0xc00002b9b0 pc=0x5afe89b35789
net.(*TCPListener).accept(0xc00007c1c0)
      net/tcpsock_posix.go:159 +0x1e fp=0xc00002ba90 sp=0xc00002ba68 pc=0x5afe89b464be
net.(*TCPListener).Accept(0xc00007c1c0)
      net/tcpsock.go:327 +0x30 fp=0xc00002bac0 sp=0xc00002ba90 pc=0x5afe89b45810
net/http.(*onceCloseListener).Accept(0xc0001a2000?)
      <autogenerated>:1 +0x24 fp=0xc00002bad8 sp=0xc00002bac0 pc=0x5afe89c6c924
net/http.(*Server).Serve(0xc0000fe000, {0x5afe89fc76c0, 0xc00007c1c0})
      net/http/server.go:3260 +0x33e fp=0xc00002bc08 sp=0xc00002bad8 pc=0x5afe89c6373e
main.main()
      github.com/ollama/ollama/llama/runner/runner.go:927 +0x104c fp=0xc00002bf50 sp=0xc00002bc08 pc=0x5afe89c8e70c
runtime.main()
      runtime/proc.go:271 +0x29d fp=0xc00002bfe0 sp=0xc00002bf50 pc=0x5afe89a49add
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc00002bfe8 sp=0xc00002bfe0 pc=0x5afe89a7bce1

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
      runtime/proc.go:402 +0xce fp=0xc00006cfa8 sp=0xc00006cf88 pc=0x5afe89a49f0e
runtime.goparkunlock(...)
      runtime/proc.go:408
runtime.forcegchelper()
      runtime/proc.go:326 +0xb8 fp=0xc00006cfe0 sp=0xc00006cfa8 pc=0x5afe89a49d98
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc00006cfe8 sp=0xc00006cfe0 pc=0x5afe89a7bce1
created by runtime.init.6 in goroutine 1
      runtime/proc.go:314 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
      runtime/proc.go:402 +0xce fp=0xc00006d780 sp=0xc00006d760 pc=0x5afe89a49f0e
runtime.goparkunlock(...)
      runtime/proc.go:408
runtime.bgsweep(0xc000022070)
      runtime/mgcsweep.go:278 +0x94 fp=0xc00006d7c8 sp=0xc00006d780 pc=0x5afe89a34a54
runtime.gcenable.gowrap1()
      runtime/mgc.go:203 +0x25 fp=0xc00006d7e0 sp=0xc00006d7c8 pc=0x5afe89a29585
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc00006d7e8 sp=0xc00006d7e0 pc=0x5afe89a7bce1
created by runtime.gcenable in goroutine 1
      runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0xc000022070?, 0x5afe89ec9be8?, 0x1?, 0x0?, 0xc000007340?)
      runtime/proc.go:402 +0xce fp=0xc00006df78 sp=0xc00006df58 pc=0x5afe89a49f0e
runtime.goparkunlock(...)
      runtime/proc.go:408
runtime.(*scavengerState).park(0x5afe8a1944c0)
      runtime/mgcscavenge.go:425 +0x49 fp=0xc00006dfa8 sp=0xc00006df78 pc=0x5afe89a32449
runtime.bgscavenge(0xc000022070)
      runtime/mgcscavenge.go:653 +0x3c fp=0xc00006dfc8 sp=0xc00006dfa8 pc=0x5afe89a329dc
runtime.gcenable.gowrap2()
      runtime/mgc.go:204 +0x25 fp=0xc00006dfe0 sp=0xc00006dfc8 pc=0x5afe89a29525
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc00006dfe8 sp=0xc00006dfe0 pc=0x5afe89a7bce1
created by runtime.gcenable in goroutine 1
      runtime/mgc.go:204 +0xa5

goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]:
runtime.gopark(0xc00006c648?, 0x5afe89a1ce85?, 0xa8?, 0x1?, 0xc0000061c0?)
      runtime/proc.go:402 +0xce fp=0xc00006c620 sp=0xc00006c600 pc=0x5afe89a49f0e
runtime.runfinq()
      runtime/mfinal.go:194 +0x107 fp=0xc00006c7e0 sp=0xc00006c620 pc=0x5afe89a285c7
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc00006c7e8 sp=0xc00006c7e0 pc=0x5afe89a7bce1
created by runtime.createfing in goroutine 1
      runtime/mfinal.go:164 +0x3d

goroutine 18 gp=0xc0001a8000 m=nil [chan receive]:
runtime.gopark(0x5afe89a79cf4?, 0xc0000ef8c0?, 0x85?, 0xc0?, 0xc0000ef8a8?)
      runtime/proc.go:402 +0xce fp=0xc0000ef888 sp=0xc0000ef868 pc=0x5afe89a49f0e
runtime.chanrecv(0xc000098540, 0xc0000efa68, 0x1)
      runtime/chan.go:583 +0x3bf fp=0xc0000ef900 sp=0xc0000ef888 pc=0x5afe89a158df
runtime.chanrecv1(0xc0001b20e0?, 0xc000270008?)
      runtime/chan.go:442 +0x12 fp=0xc0000ef928 sp=0xc0000ef900 pc=0x5afe89a15512
main.(*Server).embeddings(0xc0000cc120, {0x5afe89fc7870, 0xc0000f68c0}, 0xc000016ea0)
      github.com/ollama/ollama/llama/runner/runner.go:738 +0x5cb fp=0xc0000efab8 sp=0xc0000ef928 pc=0x5afe89c8cbeb
main.(*Server).embeddings-fm({0x5afe89fc7870?, 0xc0000f68c0?}, 0x5afe89c67a6d?)
      <autogenerated>:1 +0x36 fp=0xc0000efae8 sp=0xc0000efab8 pc=0x5afe89c8eff6
net/http.HandlerFunc.ServeHTTP(0xc000018ea0?, {0x5afe89fc7870?, 0xc0000f68c0?}, 0x10?)
      net/http/server.go:2171 +0x29 fp=0xc0000efb10 sp=0xc0000efae8 pc=0x5afe89c60509
net/http.(*ServeMux).ServeHTTP(0x5afe89a1ce85?, {0x5afe89fc7870, 0xc0000f68c0}, 0xc000016ea0)
      net/http/server.go:2688 +0x1ad fp=0xc0000efb60 sp=0xc0000efb10 pc=0x5afe89c6238d
net/http.serverHandler.ServeHTTP({0x5afe89fc6bc0?}, {0x5afe89fc7870?, 0xc0000f68c0?}, 0x6?)
      net/http/server.go:3142 +0x8e fp=0xc0000efb90 sp=0xc0000efb60 pc=0x5afe89c633ae
net/http.(*conn).serve(0xc0001a2000, {0x5afe89fc7c90, 0xc0000b0db0})
      net/http/server.go:2044 +0x5e8 fp=0xc0000effb8 sp=0xc0000efb90 pc=0x5afe89c5f148
net/http.(*Server).Serve.gowrap3()
      net/http/server.go:3290 +0x28 fp=0xc0000effe0 sp=0xc0000effb8 pc=0x5afe89c63b28
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc0000effe8 sp=0xc0000effe0 pc=0x5afe89a7bce1
created by net/http.(*Server).Serve in goroutine 1
      net/http/server.go:3290 +0x4b4

goroutine 12 gp=0xc0000fc700 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?)
      runtime/proc.go:402 +0xce fp=0xc00021a5a8 sp=0xc00021a588 pc=0x5afe89a49f0e
runtime.netpollblock(0x5afe89ab0818?, 0x89a12a26?, 0xfe?)
      runtime/netpoll.go:573 +0xf7 fp=0xc00021a5e0 sp=0xc00021a5a8 pc=0x5afe89a42157
internal/poll.runtime_pollWait(0x73b93aa67f28, 0x72)
      runtime/netpoll.go:345 +0x85 fp=0xc00021a600 sp=0xc00021a5e0 pc=0x5afe89a769a5
internal/poll.(*pollDesc).wait(0xc0001a0000?, 0xc0000b0e21?, 0x0)
      internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00021a628 sp=0xc00021a600 pc=0x5afe89ac6c87
internal/poll.(*pollDesc).waitRead(...)
      internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0001a0000, {0xc0000b0e21, 0x1, 0x1})
      internal/poll/fd_unix.go:164 +0x27a fp=0xc00021a6c0 sp=0xc00021a628 pc=0x5afe89ac77da
net.(*netFD).Read(0xc0001a0000, {0xc0000b0e21?, 0x0?, 0x0?})
      net/fd_posix.go:55 +0x25 fp=0xc00021a708 sp=0xc00021a6c0 pc=0x5afe89b34685
net.(*conn).Read(0xc000198008, {0xc0000b0e21?, 0x0?, 0x0?})
      net/net.go:185 +0x45 fp=0xc00021a750 sp=0xc00021a708 pc=0x5afe89b3e945
net.(*TCPConn).Read(0x0?, {0xc0000b0e21?, 0x0?, 0x0?})
      <autogenerated>:1 +0x25 fp=0xc00021a780 sp=0xc00021a750 pc=0x5afe89b4a325
net/http.(*connReader).backgroundRead(0xc0000b0e10)
      net/http/server.go:681 +0x37 fp=0xc00021a7c8 sp=0xc00021a780 pc=0x5afe89c590b7
net/http.(*connReader).startBackgroundRead.gowrap2()
      net/http/server.go:677 +0x25 fp=0xc00021a7e0 sp=0xc00021a7c8 pc=0x5afe89c58fe5
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc00021a7e8 sp=0xc00021a7e0 pc=0x5afe89a7bce1
created by net/http.(*connReader).startBackgroundRead in goroutine 18
      net/http/server.go:677 +0xba

rax    0x204a03fe0
rbx    0x73b8a4bc0a00
rcx    0xff8
rdx    0x73b8a4884240
rdi    0x73b8a4884250
rsi    0x0
rbp    0x73b8cd5ddc60
rsp    0x73b8cd5ddc40
r8     0x4
r9     0x0
r10    0x4
r11    0x8
r12    0x73b8a54cc290
r13    0x73b8a4884250
r14    0x0
r15    0x73b98d71ed00
rip    0x73b9424600d7
rflags 0x10297
cs     0x33
fs     0x0
gs     0x0
SIGABRT: abort
PC=0x73b94186d9fc m=3 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 7 gp=0xc0000fc000 m=3 mp=0xc000073008 [syscall]:
runtime.cgocall(0x5afe89c8f7f0, 0xc000080ad8)
      runtime/cgocall.go:157 +0x4b fp=0xc000080ab0 sp=0xc000080a78 pc=0x5afe89a132cb
github.com/ollama/ollama/llama._Cfunc_llama_decode(0x73b8c80064a0, {0xb0, 0x73b8c8132a40, 0x0, 0x73b8c8133250, 0x73b8c8133a60, 0x73b8c8134270, 0x73b8a4eae910, 0x0, 0x0, ...})
      _cgo_gotypes.go:512 +0x4f fp=0xc000080ad8 sp=0xc000080ab0 pc=0x5afe89b10b0f
github.com/ollama/ollama/llama.(*Context).Decode.func1(0x5afe89a23178?, 0x3ff?)
      github.com/ollama/ollama/llama/llama.go:124 +0x11e fp=0xc000080be0 sp=0xc000080ad8 pc=0x5afe89b129be
github.com/ollama/ollama/llama.(*Context).Decode(0xc000262808?, 0xc000080ef8?)
      github.com/ollama/ollama/llama/llama.go:124 +0x17 fp=0xc000080c28 sp=0xc000080be0 pc=0x5afe89b127d7
main.(*Server).processBatch(0xc0000cc120, 0xc000080ef8, 0xc000080e90)
      github.com/ollama/ollama/llama/runner/runner.go:434 +0x285 fp=0xc000080df8 sp=0xc000080c28 pc=0x5afe89c8a9a5
main.(*Server).run(0xc0000cc120, {0x5afe89fc7cc8, 0xc0000a2050})
      github.com/ollama/ollama/llama/runner/runner.go:352 +0x359 fp=0xc000080fb8 sp=0xc000080df8 pc=0x5afe89c8a279
main.main.gowrap2()
      github.com/ollama/ollama/llama/runner/runner.go:907 +0x28 fp=0xc000080fe0 sp=0xc000080fb8 pc=0x5afe89c8e988
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x5afe89a7bce1
created by main.main in goroutine 1
      github.com/ollama/ollama/llama/runner/runner.go:907 +0xcab

goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
runtime.gopark(0x1?, 0xc00002b908?, 0xf4?, 0x9c?, 0xc00002b8e8?)
      runtime/proc.go:402 +0xce fp=0xc00002b888 sp=0xc00002b868 pc=0x5afe89a49f0e
runtime.netpollblock(0x10?, 0x89a12a26?, 0xfe?)
      runtime/netpoll.go:573 +0xf7 fp=0xc00002b8c0 sp=0xc00002b888 pc=0x5afe89a42157
internal/poll.runtime_pollWait(0x73b93aa68020, 0x72)
      runtime/netpoll.go:345 +0x85 fp=0xc00002b8e0 sp=0xc00002b8c0 pc=0x5afe89a769a5
internal/poll.(*pollDesc).wait(0x3?, 0x73b93aaa3e88?, 0x0)
      internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00002b908 sp=0xc00002b8e0 pc=0x5afe89ac6c87
internal/poll.(*pollDesc).waitRead(...)
      internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc0000f8080)
      internal/poll/fd_unix.go:611 +0x2ac fp=0xc00002b9b0 sp=0xc00002b908 pc=0x5afe89ac814c
net.(*netFD).accept(0xc0000f8080)
      net/fd_unix.go:172 +0x29 fp=0xc00002ba68 sp=0xc00002b9b0 pc=0x5afe89b35789
net.(*TCPListener).accept(0xc00007c1c0)
      net/tcpsock_posix.go:159 +0x1e fp=0xc00002ba90 sp=0xc00002ba68 pc=0x5afe89b464be
net.(*TCPListener).Accept(0xc00007c1c0)
      net/tcpsock.go:327 +0x30 fp=0xc00002bac0 sp=0xc00002ba90 pc=0x5afe89b45810
net/http.(*onceCloseListener).Accept(0xc0001a2000?)
      <autogenerated>:1 +0x24 fp=0xc00002bad8 sp=0xc00002bac0 pc=0x5afe89c6c924
net/http.(*Server).Serve(0xc0000fe000, {0x5afe89fc76c0, 0xc00007c1c0})
      net/http/server.go:3260 +0x33e fp=0xc00002bc08 sp=0xc00002bad8 pc=0x5afe89c6373e
main.main()
      github.com/ollama/ollama/llama/runner/runner.go:927 +0x104c fp=0xc00002bf50 sp=0xc00002bc08 pc=0x5afe89c8e70c
runtime.main()
      runtime/proc.go:271 +0x29d fp=0xc00002bfe0 sp=0xc00002bf50 pc=0x5afe89a49add
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc00002bfe8 sp=0xc00002bfe0 pc=0x5afe89a7bce1

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
      runtime/proc.go:402 +0xce fp=0xc00006cfa8 sp=0xc00006cf88 pc=0x5afe89a49f0e
runtime.goparkunlock(...)
      runtime/proc.go:408
runtime.forcegchelper()
      runtime/proc.go:326 +0xb8 fp=0xc00006cfe0 sp=0xc00006cfa8 pc=0x5afe89a49d98
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc00006cfe8 sp=0xc00006cfe0 pc=0x5afe89a7bce1
created by runtime.init.6 in goroutine 1
      runtime/proc.go:314 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
      runtime/proc.go:402 +0xce fp=0xc00006d780 sp=0xc00006d760 pc=0x5afe89a49f0e
runtime.goparkunlock(...)
      runtime/proc.go:408
runtime.bgsweep(0xc000022070)
      runtime/mgcsweep.go:278 +0x94 fp=0xc00006d7c8 sp=0xc00006d780 pc=0x5afe89a34a54
runtime.gcenable.gowrap1()
      runtime/mgc.go:203 +0x25 fp=0xc00006d7e0 sp=0xc00006d7c8 pc=0x5afe89a29585
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc00006d7e8 sp=0xc00006d7e0 pc=0x5afe89a7bce1
created by runtime.gcenable in goroutine 1
      runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0xc000022070?, 0x5afe89ec9be8?, 0x1?, 0x0?, 0xc000007340?)
      runtime/proc.go:402 +0xce fp=0xc00006df78 sp=0xc00006df58 pc=0x5afe89a49f0e
runtime.goparkunlock(...)
      runtime/proc.go:408
runtime.(*scavengerState).park(0x5afe8a1944c0)
      runtime/mgcscavenge.go:425 +0x49 fp=0xc00006dfa8 sp=0xc00006df78 pc=0x5afe89a32449
runtime.bgscavenge(0xc000022070)
      runtime/mgcscavenge.go:653 +0x3c fp=0xc00006dfc8 sp=0xc00006dfa8 pc=0x5afe89a329dc
runtime.gcenable.gowrap2()
      runtime/mgc.go:204 +0x25 fp=0xc00006dfe0 sp=0xc00006dfc8 pc=0x5afe89a29525
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc00006dfe8 sp=0xc00006dfe0 pc=0x5afe89a7bce1
created by runtime.gcenable in goroutine 1
      runtime/mgc.go:204 +0xa5

goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]:
runtime.gopark(0xc00006c648?, 0x5afe89a1ce85?, 0xa8?, 0x1?, 0xc0000061c0?)
      runtime/proc.go:402 +0xce fp=0xc00006c620 sp=0xc00006c600 pc=0x5afe89a49f0e
runtime.runfinq()
      runtime/mfinal.go:194 +0x107 fp=0xc00006c7e0 sp=0xc00006c620 pc=0x5afe89a285c7
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc00006c7e8 sp=0xc00006c7e0 pc=0x5afe89a7bce1
created by runtime.createfing in goroutine 1
      runtime/mfinal.go:164 +0x3d

goroutine 18 gp=0xc0001a8000 m=nil [chan receive]:
runtime.gopark(0x5afe89a79cf4?, 0xc0000ef8c0?, 0x85?, 0xc0?, 0xc0000ef8a8?)
      runtime/proc.go:402 +0xce fp=0xc0000ef888 sp=0xc0000ef868 pc=0x5afe89a49f0e
runtime.chanrecv(0xc000098540, 0xc0000efa68, 0x1)
      runtime/chan.go:583 +0x3bf fp=0xc0000ef900 sp=0xc0000ef888 pc=0x5afe89a158df
runtime.chanrecv1(0xc0001b20e0?, 0xc000270008?)
      runtime/chan.go:442 +0x12 fp=0xc0000ef928 sp=0xc0000ef900 pc=0x5afe89a15512
main.(*Server).embeddings(0xc0000cc120, {0x5afe89fc7870, 0xc0000f68c0}, 0xc000016ea0)
      github.com/ollama/ollama/llama/runner/runner.go:738 +0x5cb fp=0xc0000efab8 sp=0xc0000ef928 pc=0x5afe89c8cbeb
main.(*Server).embeddings-fm({0x5afe89fc7870?, 0xc0000f68c0?}, 0x5afe89c67a6d?)
      <autogenerated>:1 +0x36 fp=0xc0000efae8 sp=0xc0000efab8 pc=0x5afe89c8eff6
net/http.HandlerFunc.ServeHTTP(0xc000018ea0?, {0x5afe89fc7870?, 0xc0000f68c0?}, 0x10?)
      net/http/server.go:2171 +0x29 fp=0xc0000efb10 sp=0xc0000efae8 pc=0x5afe89c60509
net/http.(*ServeMux).ServeHTTP(0x5afe89a1ce85?, {0x5afe89fc7870, 0xc0000f68c0}, 0xc000016ea0)
      net/http/server.go:2688 +0x1ad fp=0xc0000efb60 sp=0xc0000efb10 pc=0x5afe89c6238d
net/http.serverHandler.ServeHTTP({0x5afe89fc6bc0?}, {0x5afe89fc7870?, 0xc0000f68c0?}, 0x6?)
      net/http/server.go:3142 +0x8e fp=0xc0000efb90 sp=0xc0000efb60 pc=0x5afe89c633ae
net/http.(*conn).serve(0xc0001a2000, {0x5afe89fc7c90, 0xc0000b0db0})
      net/http/server.go:2044 +0x5e8 fp=0xc0000effb8 sp=0xc0000efb90 pc=0x5afe89c5f148
net/http.(*Server).Serve.gowrap3()
      net/http/server.go:3290 +0x28 fp=0xc0000effe0 sp=0xc0000effb8 pc=0x5afe89c63b28
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc0000effe8 sp=0xc0000effe0 pc=0x5afe89a7bce1
created by net/http.(*Server).Serve in goroutine 1
      net/http/server.go:3290 +0x4b4

goroutine 12 gp=0xc0000fc700 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?)
      runtime/proc.go:402 +0xce fp=0xc00021a5a8 sp=0xc00021a588 pc=0x5afe89a49f0e
runtime.netpollblock(0x5afe89ab0818?, 0x89a12a26?, 0xfe?)
      runtime/netpoll.go:573 +0xf7 fp=0xc00021a5e0 sp=0xc00021a5a8 pc=0x5afe89a42157
internal/poll.runtime_pollWait(0x73b93aa67f28, 0x72)
      runtime/netpoll.go:345 +0x85 fp=0xc00021a600 sp=0xc00021a5e0 pc=0x5afe89a769a5
internal/poll.(*pollDesc).wait(0xc0001a0000?, 0xc0000b0e21?, 0x0)
      internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00021a628 sp=0xc00021a600 pc=0x5afe89ac6c87
internal/poll.(*pollDesc).waitRead(...)
      internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0001a0000, {0xc0000b0e21, 0x1, 0x1})
      internal/poll/fd_unix.go:164 +0x27a fp=0xc00021a6c0 sp=0xc00021a628 pc=0x5afe89ac77da
net.(*netFD).Read(0xc0001a0000, {0xc0000b0e21?, 0x0?, 0x0?})
      net/fd_posix.go:55 +0x25 fp=0xc00021a708 sp=0xc00021a6c0 pc=0x5afe89b34685
net.(*conn).Read(0xc000198008, {0xc0000b0e21?, 0x0?, 0x0?})
      net/net.go:185 +0x45 fp=0xc00021a750 sp=0xc00021a708 pc=0x5afe89b3e945
net.(*TCPConn).Read(0x0?, {0xc0000b0e21?, 0x0?, 0x0?})
      <autogenerated>:1 +0x25 fp=0xc00021a780 sp=0xc00021a750 pc=0x5afe89b4a325
net/http.(*connReader).backgroundRead(0xc0000b0e10)
      net/http/server.go:681 +0x37 fp=0xc00021a7c8 sp=0xc00021a780 pc=0x5afe89c590b7
net/http.(*connReader).startBackgroundRead.gowrap2()
      net/http/server.go:677 +0x25 fp=0xc00021a7e0 sp=0xc00021a7c8 pc=0x5afe89c58fe5
runtime.goexit({})
      runtime/asm_amd64.s:1695 +0x1 fp=0xc00021a7e8 sp=0xc00021a7e0 pc=0x5afe89a7bce1
created by net/http.(*connReader).startBackgroundRead in goroutine 18
      net/http/server.go:677 +0xba

rax    0x0
rbx    0x73b8cd600000
rcx    0x73b94186d9fc
rdx    0x6
rdi    0x1e
rsi    0x20
rbp    0x20
rsp    0x73b8cd5ddcb0
r8     0x73b8cd5ddd80
r9     0x73b8cd5ddd50
r10    0x8
r11    0x246
r12    0x6
r13    0x16
r14    0x0
r15    0x73b8cd5de090
rip    0x73b94186d9fc
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
2024-10-31T08:35:32.735Z level=INFO source=routes.go:490 msg="embedding generation failed: do embedding request: Post \"http://127.0.0.1:43231/embedding\": EOF"
[GIN] 2024/10/31 - 08:35:32 | 500 |  1.690382474s |    172.21.0.150 | POST     "/api/embeddings"
2024-10-31T08:35:32.735Z level=DEBUG source=sched.go:467 msg="context for request finished"
2024-10-31T08:35:32.735Z level=DEBUG source=sched.go:340 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 duration=5m0s
2024-10-31T08:35:32.735Z level=DEBUG source=sched.go:358 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 refCount=0
2024-10-31T08:35:32.751Z level=DEBUG source=llama-server.go:395 msg="llama runner terminated" error="exit status 2"
<!-- gh-comment-id:2449326738 --> @PierreMesure commented on GitHub (Oct 31, 2024): And a final one using 0.4.0-rc5 and [jeffh/intfloat-multilingual-e5-large-instruct:f32](https://ollama.com/jeffh/intfloat-multilingual-e5-large-instruct:f32) . The stacktrace is different and I removed the text that was embedded (`content="XXX"`). It also failed at the first embedding. ```log 2024/10/31 08:35:27 routes.go:1170: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:8 OLLAMA_ORIGINS:[chrome-extension://* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 2024-10-31T08:35:27.838Z level=INFO source=images.go:754 msg="total blobs: 108" 2024-10-31T08:35:27.841Z level=INFO source=images.go:761 msg="total unused blobs removed: 0" 2024-10-31T08:35:27.843Z level=INFO source=routes.go:1217 msg="Listening on [::]:11434 (version 0.4.0-rc5)" 2024-10-31T08:35:27.843Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server 2024-10-31T08:35:27.843Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server 2024-10-31T08:35:27.843Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server 2024-10-31T08:35:27.843Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server 2024-10-31T08:35:27.843Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server 2024-10-31T08:35:27.843Z level=INFO source=common.go:82 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12]" 2024-10-31T08:35:27.843Z level=DEBUG source=common.go:83 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" 2024-10-31T08:35:27.843Z level=DEBUG source=sched.go:106 msg="starting llm scheduler" 2024-10-31T08:35:27.843Z level=INFO source=gpu.go:221 msg="looking for compatible GPUs" 2024-10-31T08:35:27.845Z level=DEBUG source=gpu.go:94 msg="searching for GPU discovery libraries for NVIDIA" 2024-10-31T08:35:27.845Z level=DEBUG source=gpu.go:505 msg="Searching for GPU library" name=libcuda.so* 2024-10-31T08:35:27.845Z level=DEBUG source=gpu.go:528 msg="gpu library search" globs="[/usr/lib/ollama/libcuda.so* /usr/local/nvidia/lib/libcuda.so* /usr/local/nvidia/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" 2024-10-31T08:35:27.873Z level=DEBUG source=gpu.go:562 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.535.183.01] CUDA driver version: 12.2 2024-10-31T08:35:27.879Z level=DEBUG source=gpu.go:129 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.535.183.01 [GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] CUDA totalMem 48669 mb [GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] CUDA freeMem 48117 mb [GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4] Compute Capability 8.6 2024-10-31T08:35:27.975Z level=DEBUG source=amd_linux.go:416 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library 2024-10-31T08:35:27.975Z level=INFO source=types.go:123 msg="inference compute" id=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 library=cuda variant=v12 compute=8.6 driver=12.2 name="NVIDIA RTX A6000" total="47.5 GiB" available="47.0 GiB" 2024-10-31T08:35:31.046Z level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="125.5 GiB" before.free="90.4 GiB" before.free_swap="24.6 GiB" now.total="125.5 GiB" now.free="90.4 GiB" now.free_swap="24.6 GiB" CUDA driver version: 12.2 2024-10-31T08:35:31.129Z level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 name="NVIDIA RTX A6000" overhead="0 B" before.total="47.5 GiB" before.free="47.0 GiB" now.total="47.5 GiB" now.free="47.0 GiB" now.used="551.9 MiB" releasing cuda driver library 2024-10-31T08:35:31.226Z level=DEBUG source=sched.go:225 msg="loading first model" model=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 2024-10-31T08:35:31.226Z level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[47.0 GiB]" 2024-10-31T08:35:31.226Z level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 parallel=1 available=50454462464 required="2.6 GiB" 2024-10-31T08:35:31.226Z level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="125.5 GiB" before.free="90.4 GiB" before.free_swap="24.6 GiB" now.total="125.5 GiB" now.free="90.4 GiB" now.free_swap="24.6 GiB" CUDA driver version: 12.2 2024-10-31T08:35:31.302Z level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4 name="NVIDIA RTX A6000" overhead="0 B" before.total="47.5 GiB" before.free="47.0 GiB" now.total="47.5 GiB" now.free="47.0 GiB" now.used="551.9 MiB" releasing cuda driver library 2024-10-31T08:35:31.302Z level=INFO source=llama-server.go:72 msg="system memory" total="125.5 GiB" free="90.4 GiB" free_swap="24.6 GiB" 2024-10-31T08:35:31.302Z level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[47.0 GiB]" 2024-10-31T08:35:31.303Z level=INFO source=memory.go:346 msg="offload to cuda" layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[47.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.6 GiB" memory.required.partial="2.6 GiB" memory.required.kv="12.0 MiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.1 GiB" memory.weights.repeating="188.6 MiB" memory.weights.nonrepeating="976.6 MiB" memory.graph.full="32.0 MiB" memory.graph.partial="32.0 MiB" 2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server 2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server 2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server 2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server 2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server 2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server 2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server 2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server 2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server 2024-10-31T08:35:31.303Z level=DEBUG source=common.go:327 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server 2024-10-31T08:35:31.303Z level=INFO source=llama-server.go:355 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 --ctx-size 2048 --batch-size 512 --embedding --n-gpu-layers 25 --verbose --threads 12 --parallel 1 --port 43231" 2024-10-31T08:35:31.304Z level=DEBUG source=llama-server.go:372 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/runners/cuda_v12:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-1059788e-3724-4aee-38c3-5c6eb2340dc4]" 2024-10-31T08:35:31.304Z level=INFO source=sched.go:450 msg="loaded runners" count=1 2024-10-31T08:35:31.304Z level=INFO source=llama-server.go:534 msg="waiting for llama runner to start responding" 2024-10-31T08:35:31.304Z level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server error" 2024-10-31T08:35:31.341Z level=INFO source=runner.go:869 msg="starting go runner" 2024-10-31T08:35:31.341Z level=DEBUG source=runner.go:870 msg="system info" cpu="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " threads=12 2024-10-31T08:35:31.341Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:43231" llama_model_loader: loaded meta data with 38 key-value pairs and 389 tensors from /root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = multilingual-e5-large-instruct llama_model_loader: - kv 3: general.organization str = Tmp llama_model_loader: - kv 4: general.finetune str = instruct llama_model_loader: - kv 5: general.basename str = intfloat-multilingual-e5 llama_model_loader: - kv 6: general.size_label str = large llama_model_loader: - kv 7: general.license str = mit llama_model_loader: - kv 8: general.tags arr[str,3] = ["mteb", "sentence-transformers", "tr... llama_model_loader: - kv 9: general.languages arr[str,94] = ["multilingual", "af", "am", "ar", "a... llama_model_loader: - kv 10: bert.block_count u32 = 24 llama_model_loader: - kv 11: bert.context_length u32 = 512 llama_model_loader: - kv 12: bert.embedding_length u32 = 1024 llama_model_loader: - kv 13: bert.feed_forward_length u32 = 4096 llama_model_loader: - kv 14: bert.attention.head_count u32 = 16 llama_model_loader: - kv 15: bert.attention.layer_norm_epsilon f32 = 0.000010 llama_model_loader: - kv 16: general.file_type u32 = 0 llama_model_loader: - kv 17: bert.attention.causal bool = false llama_model_loader: - kv 18: bert.pooling_type u32 = 1 llama_model_loader: - kv 19: tokenizer.ggml.model str = t5 llama_model_loader: - kv 20: tokenizer.ggml.pre str = default llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,250002] = ["<s>", "<pad>", "</s>", "<unk>", ","... 2024-10-31T08:35:31.556Z level=INFO source=llama-server.go:568 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,250002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,250002] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.add_space_prefix bool = true llama_model_loader: - kv 25: tokenizer.ggml.token_type_count u32 = 1 llama_model_loader: - kv 26: tokenizer.ggml.remove_extra_whitespaces bool = true llama_model_loader: - kv 27: tokenizer.ggml.precompiled_charsmap arr[u8,237539] = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,... llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 31: tokenizer.ggml.seperator_token_id u32 = 2 llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 1 llama_model_loader: - kv 33: tokenizer.ggml.cls_token_id u32 = 0 llama_model_loader: - kv 34: tokenizer.ggml.mask_token_id u32 = 250001 llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = true llama_model_loader: - kv 37: general.quantization_version u32 = 2 llama_model_loader: - type f32: 389 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 4 llm_load_vocab: token to piece cache size = 2.1668 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = UGM llm_load_print_meta: n_vocab = 250002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 1024 llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 4096 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 335M llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 558.84 M llm_load_print_meta: model size = 2.08 GiB (32.00 BPW) llm_load_print_meta: general.name = multilingual-e5-large-instruct llm_load_print_meta: BOS token = 0 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: SEP token = 2 '</s>' llm_load_print_meta: PAD token = 1 '<pad>' llm_load_print_meta: CLS token = 0 '<s>' llm_load_print_meta: MASK token = 250001 '[PAD250000]' llm_load_print_meta: LF token = 6 '▁' llm_load_print_meta: EOG token = 2 '</s>' llm_load_print_meta: max token length = 48 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.32 MiB llm_load_tensors: offloading 24 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 25/25 layers to GPU llm_load_tensors: CPU buffer size = 978.57 MiB llm_load_tensors: CUDA0 buffer size = 1153.23 MiB 2024-10-31T08:35:32.310Z level=DEBUG source=llama-server.go:579 msg="model load progress 0.65" llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 192.00 MiB llama_new_context_with_model: KV self size = 192.00 MiB, K (f16): 96.00 MiB, V (f16): 96.00 MiB llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 26.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 6.00 MiB llama_new_context_with_model: graph nodes = 851 llama_new_context_with_model: graph splits = 2 2024-10-31T08:35:32.561Z level=INFO source=llama-server.go:573 msg="llama runner started in 1.26 seconds" 2024-10-31T08:35:32.561Z level=DEBUG source=sched.go:463 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 2024-10-31T08:35:32.562Z level=DEBUG source=runner.go:713 msg="embedding request" content="XXX" 2024-10-31T08:35:32.565Z level=DEBUG source=cache.go:105 msg="loading cache slot" id=0 cache=0 prompt=688 used=0 remaining=688 ggml.c:13425: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed SIGSEGV: segmentation violation PC=0x73b9424600d7 m=3 sigcode=1 addr=0x204a03fe0 signal arrived during cgo execution goroutine 7 gp=0xc0000fc000 m=3 mp=0xc000073008 [syscall]: runtime.cgocall(0x5afe89c8f7f0, 0xc000080ad8) runtime/cgocall.go:157 +0x4b fp=0xc000080ab0 sp=0xc000080a78 pc=0x5afe89a132cb github.com/ollama/ollama/llama._Cfunc_llama_decode(0x73b8c80064a0, {0xb0, 0x73b8c8132a40, 0x0, 0x73b8c8133250, 0x73b8c8133a60, 0x73b8c8134270, 0x73b8a4eae910, 0x0, 0x0, ...}) _cgo_gotypes.go:512 +0x4f fp=0xc000080ad8 sp=0xc000080ab0 pc=0x5afe89b10b0f github.com/ollama/ollama/llama.(*Context).Decode.func1(0x5afe89a23178?, 0x3ff?) github.com/ollama/ollama/llama/llama.go:124 +0x11e fp=0xc000080be0 sp=0xc000080ad8 pc=0x5afe89b129be github.com/ollama/ollama/llama.(*Context).Decode(0xc000262808?, 0xc000080ef8?) github.com/ollama/ollama/llama/llama.go:124 +0x17 fp=0xc000080c28 sp=0xc000080be0 pc=0x5afe89b127d7 main.(*Server).processBatch(0xc0000cc120, 0xc000080ef8, 0xc000080e90) github.com/ollama/ollama/llama/runner/runner.go:434 +0x285 fp=0xc000080df8 sp=0xc000080c28 pc=0x5afe89c8a9a5 main.(*Server).run(0xc0000cc120, {0x5afe89fc7cc8, 0xc0000a2050}) github.com/ollama/ollama/llama/runner/runner.go:352 +0x359 fp=0xc000080fb8 sp=0xc000080df8 pc=0x5afe89c8a279 main.main.gowrap2() github.com/ollama/ollama/llama/runner/runner.go:907 +0x28 fp=0xc000080fe0 sp=0xc000080fb8 pc=0x5afe89c8e988 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x5afe89a7bce1 created by main.main in goroutine 1 github.com/ollama/ollama/llama/runner/runner.go:907 +0xcab goroutine 1 gp=0xc0000061c0 m=nil [IO wait]: runtime.gopark(0x1?, 0xc00002b908?, 0xf4?, 0x9c?, 0xc00002b8e8?) runtime/proc.go:402 +0xce fp=0xc00002b888 sp=0xc00002b868 pc=0x5afe89a49f0e runtime.netpollblock(0x10?, 0x89a12a26?, 0xfe?) runtime/netpoll.go:573 +0xf7 fp=0xc00002b8c0 sp=0xc00002b888 pc=0x5afe89a42157 internal/poll.runtime_pollWait(0x73b93aa68020, 0x72) runtime/netpoll.go:345 +0x85 fp=0xc00002b8e0 sp=0xc00002b8c0 pc=0x5afe89a769a5 internal/poll.(*pollDesc).wait(0x3?, 0x73b93aaa3e88?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00002b908 sp=0xc00002b8e0 pc=0x5afe89ac6c87 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc0000f8080) internal/poll/fd_unix.go:611 +0x2ac fp=0xc00002b9b0 sp=0xc00002b908 pc=0x5afe89ac814c net.(*netFD).accept(0xc0000f8080) net/fd_unix.go:172 +0x29 fp=0xc00002ba68 sp=0xc00002b9b0 pc=0x5afe89b35789 net.(*TCPListener).accept(0xc00007c1c0) net/tcpsock_posix.go:159 +0x1e fp=0xc00002ba90 sp=0xc00002ba68 pc=0x5afe89b464be net.(*TCPListener).Accept(0xc00007c1c0) net/tcpsock.go:327 +0x30 fp=0xc00002bac0 sp=0xc00002ba90 pc=0x5afe89b45810 net/http.(*onceCloseListener).Accept(0xc0001a2000?) <autogenerated>:1 +0x24 fp=0xc00002bad8 sp=0xc00002bac0 pc=0x5afe89c6c924 net/http.(*Server).Serve(0xc0000fe000, {0x5afe89fc76c0, 0xc00007c1c0}) net/http/server.go:3260 +0x33e fp=0xc00002bc08 sp=0xc00002bad8 pc=0x5afe89c6373e main.main() github.com/ollama/ollama/llama/runner/runner.go:927 +0x104c fp=0xc00002bf50 sp=0xc00002bc08 pc=0x5afe89c8e70c runtime.main() runtime/proc.go:271 +0x29d fp=0xc00002bfe0 sp=0xc00002bf50 pc=0x5afe89a49add runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00002bfe8 sp=0xc00002bfe0 pc=0x5afe89a7bce1 goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc00006cfa8 sp=0xc00006cf88 pc=0x5afe89a49f0e runtime.goparkunlock(...) runtime/proc.go:408 runtime.forcegchelper() runtime/proc.go:326 +0xb8 fp=0xc00006cfe0 sp=0xc00006cfa8 pc=0x5afe89a49d98 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00006cfe8 sp=0xc00006cfe0 pc=0x5afe89a7bce1 created by runtime.init.6 in goroutine 1 runtime/proc.go:314 +0x1a goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc00006d780 sp=0xc00006d760 pc=0x5afe89a49f0e runtime.goparkunlock(...) runtime/proc.go:408 runtime.bgsweep(0xc000022070) runtime/mgcsweep.go:278 +0x94 fp=0xc00006d7c8 sp=0xc00006d780 pc=0x5afe89a34a54 runtime.gcenable.gowrap1() runtime/mgc.go:203 +0x25 fp=0xc00006d7e0 sp=0xc00006d7c8 pc=0x5afe89a29585 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00006d7e8 sp=0xc00006d7e0 pc=0x5afe89a7bce1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:203 +0x66 goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]: runtime.gopark(0xc000022070?, 0x5afe89ec9be8?, 0x1?, 0x0?, 0xc000007340?) runtime/proc.go:402 +0xce fp=0xc00006df78 sp=0xc00006df58 pc=0x5afe89a49f0e runtime.goparkunlock(...) runtime/proc.go:408 runtime.(*scavengerState).park(0x5afe8a1944c0) runtime/mgcscavenge.go:425 +0x49 fp=0xc00006dfa8 sp=0xc00006df78 pc=0x5afe89a32449 runtime.bgscavenge(0xc000022070) runtime/mgcscavenge.go:653 +0x3c fp=0xc00006dfc8 sp=0xc00006dfa8 pc=0x5afe89a329dc runtime.gcenable.gowrap2() runtime/mgc.go:204 +0x25 fp=0xc00006dfe0 sp=0xc00006dfc8 pc=0x5afe89a29525 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00006dfe8 sp=0xc00006dfe0 pc=0x5afe89a7bce1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0xa5 goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]: runtime.gopark(0xc00006c648?, 0x5afe89a1ce85?, 0xa8?, 0x1?, 0xc0000061c0?) runtime/proc.go:402 +0xce fp=0xc00006c620 sp=0xc00006c600 pc=0x5afe89a49f0e runtime.runfinq() runtime/mfinal.go:194 +0x107 fp=0xc00006c7e0 sp=0xc00006c620 pc=0x5afe89a285c7 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00006c7e8 sp=0xc00006c7e0 pc=0x5afe89a7bce1 created by runtime.createfing in goroutine 1 runtime/mfinal.go:164 +0x3d goroutine 18 gp=0xc0001a8000 m=nil [chan receive]: runtime.gopark(0x5afe89a79cf4?, 0xc0000ef8c0?, 0x85?, 0xc0?, 0xc0000ef8a8?) runtime/proc.go:402 +0xce fp=0xc0000ef888 sp=0xc0000ef868 pc=0x5afe89a49f0e runtime.chanrecv(0xc000098540, 0xc0000efa68, 0x1) runtime/chan.go:583 +0x3bf fp=0xc0000ef900 sp=0xc0000ef888 pc=0x5afe89a158df runtime.chanrecv1(0xc0001b20e0?, 0xc000270008?) runtime/chan.go:442 +0x12 fp=0xc0000ef928 sp=0xc0000ef900 pc=0x5afe89a15512 main.(*Server).embeddings(0xc0000cc120, {0x5afe89fc7870, 0xc0000f68c0}, 0xc000016ea0) github.com/ollama/ollama/llama/runner/runner.go:738 +0x5cb fp=0xc0000efab8 sp=0xc0000ef928 pc=0x5afe89c8cbeb main.(*Server).embeddings-fm({0x5afe89fc7870?, 0xc0000f68c0?}, 0x5afe89c67a6d?) <autogenerated>:1 +0x36 fp=0xc0000efae8 sp=0xc0000efab8 pc=0x5afe89c8eff6 net/http.HandlerFunc.ServeHTTP(0xc000018ea0?, {0x5afe89fc7870?, 0xc0000f68c0?}, 0x10?) net/http/server.go:2171 +0x29 fp=0xc0000efb10 sp=0xc0000efae8 pc=0x5afe89c60509 net/http.(*ServeMux).ServeHTTP(0x5afe89a1ce85?, {0x5afe89fc7870, 0xc0000f68c0}, 0xc000016ea0) net/http/server.go:2688 +0x1ad fp=0xc0000efb60 sp=0xc0000efb10 pc=0x5afe89c6238d net/http.serverHandler.ServeHTTP({0x5afe89fc6bc0?}, {0x5afe89fc7870?, 0xc0000f68c0?}, 0x6?) net/http/server.go:3142 +0x8e fp=0xc0000efb90 sp=0xc0000efb60 pc=0x5afe89c633ae net/http.(*conn).serve(0xc0001a2000, {0x5afe89fc7c90, 0xc0000b0db0}) net/http/server.go:2044 +0x5e8 fp=0xc0000effb8 sp=0xc0000efb90 pc=0x5afe89c5f148 net/http.(*Server).Serve.gowrap3() net/http/server.go:3290 +0x28 fp=0xc0000effe0 sp=0xc0000effb8 pc=0x5afe89c63b28 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc0000effe8 sp=0xc0000effe0 pc=0x5afe89a7bce1 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3290 +0x4b4 goroutine 12 gp=0xc0000fc700 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?) runtime/proc.go:402 +0xce fp=0xc00021a5a8 sp=0xc00021a588 pc=0x5afe89a49f0e runtime.netpollblock(0x5afe89ab0818?, 0x89a12a26?, 0xfe?) runtime/netpoll.go:573 +0xf7 fp=0xc00021a5e0 sp=0xc00021a5a8 pc=0x5afe89a42157 internal/poll.runtime_pollWait(0x73b93aa67f28, 0x72) runtime/netpoll.go:345 +0x85 fp=0xc00021a600 sp=0xc00021a5e0 pc=0x5afe89a769a5 internal/poll.(*pollDesc).wait(0xc0001a0000?, 0xc0000b0e21?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00021a628 sp=0xc00021a600 pc=0x5afe89ac6c87 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc0001a0000, {0xc0000b0e21, 0x1, 0x1}) internal/poll/fd_unix.go:164 +0x27a fp=0xc00021a6c0 sp=0xc00021a628 pc=0x5afe89ac77da net.(*netFD).Read(0xc0001a0000, {0xc0000b0e21?, 0x0?, 0x0?}) net/fd_posix.go:55 +0x25 fp=0xc00021a708 sp=0xc00021a6c0 pc=0x5afe89b34685 net.(*conn).Read(0xc000198008, {0xc0000b0e21?, 0x0?, 0x0?}) net/net.go:185 +0x45 fp=0xc00021a750 sp=0xc00021a708 pc=0x5afe89b3e945 net.(*TCPConn).Read(0x0?, {0xc0000b0e21?, 0x0?, 0x0?}) <autogenerated>:1 +0x25 fp=0xc00021a780 sp=0xc00021a750 pc=0x5afe89b4a325 net/http.(*connReader).backgroundRead(0xc0000b0e10) net/http/server.go:681 +0x37 fp=0xc00021a7c8 sp=0xc00021a780 pc=0x5afe89c590b7 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:677 +0x25 fp=0xc00021a7e0 sp=0xc00021a7c8 pc=0x5afe89c58fe5 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00021a7e8 sp=0xc00021a7e0 pc=0x5afe89a7bce1 created by net/http.(*connReader).startBackgroundRead in goroutine 18 net/http/server.go:677 +0xba rax 0x204a03fe0 rbx 0x73b8a4bc0a00 rcx 0xff8 rdx 0x73b8a4884240 rdi 0x73b8a4884250 rsi 0x0 rbp 0x73b8cd5ddc60 rsp 0x73b8cd5ddc40 r8 0x4 r9 0x0 r10 0x4 r11 0x8 r12 0x73b8a54cc290 r13 0x73b8a4884250 r14 0x0 r15 0x73b98d71ed00 rip 0x73b9424600d7 rflags 0x10297 cs 0x33 fs 0x0 gs 0x0 SIGABRT: abort PC=0x73b94186d9fc m=3 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 7 gp=0xc0000fc000 m=3 mp=0xc000073008 [syscall]: runtime.cgocall(0x5afe89c8f7f0, 0xc000080ad8) runtime/cgocall.go:157 +0x4b fp=0xc000080ab0 sp=0xc000080a78 pc=0x5afe89a132cb github.com/ollama/ollama/llama._Cfunc_llama_decode(0x73b8c80064a0, {0xb0, 0x73b8c8132a40, 0x0, 0x73b8c8133250, 0x73b8c8133a60, 0x73b8c8134270, 0x73b8a4eae910, 0x0, 0x0, ...}) _cgo_gotypes.go:512 +0x4f fp=0xc000080ad8 sp=0xc000080ab0 pc=0x5afe89b10b0f github.com/ollama/ollama/llama.(*Context).Decode.func1(0x5afe89a23178?, 0x3ff?) github.com/ollama/ollama/llama/llama.go:124 +0x11e fp=0xc000080be0 sp=0xc000080ad8 pc=0x5afe89b129be github.com/ollama/ollama/llama.(*Context).Decode(0xc000262808?, 0xc000080ef8?) github.com/ollama/ollama/llama/llama.go:124 +0x17 fp=0xc000080c28 sp=0xc000080be0 pc=0x5afe89b127d7 main.(*Server).processBatch(0xc0000cc120, 0xc000080ef8, 0xc000080e90) github.com/ollama/ollama/llama/runner/runner.go:434 +0x285 fp=0xc000080df8 sp=0xc000080c28 pc=0x5afe89c8a9a5 main.(*Server).run(0xc0000cc120, {0x5afe89fc7cc8, 0xc0000a2050}) github.com/ollama/ollama/llama/runner/runner.go:352 +0x359 fp=0xc000080fb8 sp=0xc000080df8 pc=0x5afe89c8a279 main.main.gowrap2() github.com/ollama/ollama/llama/runner/runner.go:907 +0x28 fp=0xc000080fe0 sp=0xc000080fb8 pc=0x5afe89c8e988 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x5afe89a7bce1 created by main.main in goroutine 1 github.com/ollama/ollama/llama/runner/runner.go:907 +0xcab goroutine 1 gp=0xc0000061c0 m=nil [IO wait]: runtime.gopark(0x1?, 0xc00002b908?, 0xf4?, 0x9c?, 0xc00002b8e8?) runtime/proc.go:402 +0xce fp=0xc00002b888 sp=0xc00002b868 pc=0x5afe89a49f0e runtime.netpollblock(0x10?, 0x89a12a26?, 0xfe?) runtime/netpoll.go:573 +0xf7 fp=0xc00002b8c0 sp=0xc00002b888 pc=0x5afe89a42157 internal/poll.runtime_pollWait(0x73b93aa68020, 0x72) runtime/netpoll.go:345 +0x85 fp=0xc00002b8e0 sp=0xc00002b8c0 pc=0x5afe89a769a5 internal/poll.(*pollDesc).wait(0x3?, 0x73b93aaa3e88?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00002b908 sp=0xc00002b8e0 pc=0x5afe89ac6c87 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc0000f8080) internal/poll/fd_unix.go:611 +0x2ac fp=0xc00002b9b0 sp=0xc00002b908 pc=0x5afe89ac814c net.(*netFD).accept(0xc0000f8080) net/fd_unix.go:172 +0x29 fp=0xc00002ba68 sp=0xc00002b9b0 pc=0x5afe89b35789 net.(*TCPListener).accept(0xc00007c1c0) net/tcpsock_posix.go:159 +0x1e fp=0xc00002ba90 sp=0xc00002ba68 pc=0x5afe89b464be net.(*TCPListener).Accept(0xc00007c1c0) net/tcpsock.go:327 +0x30 fp=0xc00002bac0 sp=0xc00002ba90 pc=0x5afe89b45810 net/http.(*onceCloseListener).Accept(0xc0001a2000?) <autogenerated>:1 +0x24 fp=0xc00002bad8 sp=0xc00002bac0 pc=0x5afe89c6c924 net/http.(*Server).Serve(0xc0000fe000, {0x5afe89fc76c0, 0xc00007c1c0}) net/http/server.go:3260 +0x33e fp=0xc00002bc08 sp=0xc00002bad8 pc=0x5afe89c6373e main.main() github.com/ollama/ollama/llama/runner/runner.go:927 +0x104c fp=0xc00002bf50 sp=0xc00002bc08 pc=0x5afe89c8e70c runtime.main() runtime/proc.go:271 +0x29d fp=0xc00002bfe0 sp=0xc00002bf50 pc=0x5afe89a49add runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00002bfe8 sp=0xc00002bfe0 pc=0x5afe89a7bce1 goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc00006cfa8 sp=0xc00006cf88 pc=0x5afe89a49f0e runtime.goparkunlock(...) runtime/proc.go:408 runtime.forcegchelper() runtime/proc.go:326 +0xb8 fp=0xc00006cfe0 sp=0xc00006cfa8 pc=0x5afe89a49d98 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00006cfe8 sp=0xc00006cfe0 pc=0x5afe89a7bce1 created by runtime.init.6 in goroutine 1 runtime/proc.go:314 +0x1a goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc00006d780 sp=0xc00006d760 pc=0x5afe89a49f0e runtime.goparkunlock(...) runtime/proc.go:408 runtime.bgsweep(0xc000022070) runtime/mgcsweep.go:278 +0x94 fp=0xc00006d7c8 sp=0xc00006d780 pc=0x5afe89a34a54 runtime.gcenable.gowrap1() runtime/mgc.go:203 +0x25 fp=0xc00006d7e0 sp=0xc00006d7c8 pc=0x5afe89a29585 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00006d7e8 sp=0xc00006d7e0 pc=0x5afe89a7bce1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:203 +0x66 goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]: runtime.gopark(0xc000022070?, 0x5afe89ec9be8?, 0x1?, 0x0?, 0xc000007340?) runtime/proc.go:402 +0xce fp=0xc00006df78 sp=0xc00006df58 pc=0x5afe89a49f0e runtime.goparkunlock(...) runtime/proc.go:408 runtime.(*scavengerState).park(0x5afe8a1944c0) runtime/mgcscavenge.go:425 +0x49 fp=0xc00006dfa8 sp=0xc00006df78 pc=0x5afe89a32449 runtime.bgscavenge(0xc000022070) runtime/mgcscavenge.go:653 +0x3c fp=0xc00006dfc8 sp=0xc00006dfa8 pc=0x5afe89a329dc runtime.gcenable.gowrap2() runtime/mgc.go:204 +0x25 fp=0xc00006dfe0 sp=0xc00006dfc8 pc=0x5afe89a29525 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00006dfe8 sp=0xc00006dfe0 pc=0x5afe89a7bce1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0xa5 goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]: runtime.gopark(0xc00006c648?, 0x5afe89a1ce85?, 0xa8?, 0x1?, 0xc0000061c0?) runtime/proc.go:402 +0xce fp=0xc00006c620 sp=0xc00006c600 pc=0x5afe89a49f0e runtime.runfinq() runtime/mfinal.go:194 +0x107 fp=0xc00006c7e0 sp=0xc00006c620 pc=0x5afe89a285c7 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00006c7e8 sp=0xc00006c7e0 pc=0x5afe89a7bce1 created by runtime.createfing in goroutine 1 runtime/mfinal.go:164 +0x3d goroutine 18 gp=0xc0001a8000 m=nil [chan receive]: runtime.gopark(0x5afe89a79cf4?, 0xc0000ef8c0?, 0x85?, 0xc0?, 0xc0000ef8a8?) runtime/proc.go:402 +0xce fp=0xc0000ef888 sp=0xc0000ef868 pc=0x5afe89a49f0e runtime.chanrecv(0xc000098540, 0xc0000efa68, 0x1) runtime/chan.go:583 +0x3bf fp=0xc0000ef900 sp=0xc0000ef888 pc=0x5afe89a158df runtime.chanrecv1(0xc0001b20e0?, 0xc000270008?) runtime/chan.go:442 +0x12 fp=0xc0000ef928 sp=0xc0000ef900 pc=0x5afe89a15512 main.(*Server).embeddings(0xc0000cc120, {0x5afe89fc7870, 0xc0000f68c0}, 0xc000016ea0) github.com/ollama/ollama/llama/runner/runner.go:738 +0x5cb fp=0xc0000efab8 sp=0xc0000ef928 pc=0x5afe89c8cbeb main.(*Server).embeddings-fm({0x5afe89fc7870?, 0xc0000f68c0?}, 0x5afe89c67a6d?) <autogenerated>:1 +0x36 fp=0xc0000efae8 sp=0xc0000efab8 pc=0x5afe89c8eff6 net/http.HandlerFunc.ServeHTTP(0xc000018ea0?, {0x5afe89fc7870?, 0xc0000f68c0?}, 0x10?) net/http/server.go:2171 +0x29 fp=0xc0000efb10 sp=0xc0000efae8 pc=0x5afe89c60509 net/http.(*ServeMux).ServeHTTP(0x5afe89a1ce85?, {0x5afe89fc7870, 0xc0000f68c0}, 0xc000016ea0) net/http/server.go:2688 +0x1ad fp=0xc0000efb60 sp=0xc0000efb10 pc=0x5afe89c6238d net/http.serverHandler.ServeHTTP({0x5afe89fc6bc0?}, {0x5afe89fc7870?, 0xc0000f68c0?}, 0x6?) net/http/server.go:3142 +0x8e fp=0xc0000efb90 sp=0xc0000efb60 pc=0x5afe89c633ae net/http.(*conn).serve(0xc0001a2000, {0x5afe89fc7c90, 0xc0000b0db0}) net/http/server.go:2044 +0x5e8 fp=0xc0000effb8 sp=0xc0000efb90 pc=0x5afe89c5f148 net/http.(*Server).Serve.gowrap3() net/http/server.go:3290 +0x28 fp=0xc0000effe0 sp=0xc0000effb8 pc=0x5afe89c63b28 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc0000effe8 sp=0xc0000effe0 pc=0x5afe89a7bce1 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3290 +0x4b4 goroutine 12 gp=0xc0000fc700 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?) runtime/proc.go:402 +0xce fp=0xc00021a5a8 sp=0xc00021a588 pc=0x5afe89a49f0e runtime.netpollblock(0x5afe89ab0818?, 0x89a12a26?, 0xfe?) runtime/netpoll.go:573 +0xf7 fp=0xc00021a5e0 sp=0xc00021a5a8 pc=0x5afe89a42157 internal/poll.runtime_pollWait(0x73b93aa67f28, 0x72) runtime/netpoll.go:345 +0x85 fp=0xc00021a600 sp=0xc00021a5e0 pc=0x5afe89a769a5 internal/poll.(*pollDesc).wait(0xc0001a0000?, 0xc0000b0e21?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00021a628 sp=0xc00021a600 pc=0x5afe89ac6c87 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc0001a0000, {0xc0000b0e21, 0x1, 0x1}) internal/poll/fd_unix.go:164 +0x27a fp=0xc00021a6c0 sp=0xc00021a628 pc=0x5afe89ac77da net.(*netFD).Read(0xc0001a0000, {0xc0000b0e21?, 0x0?, 0x0?}) net/fd_posix.go:55 +0x25 fp=0xc00021a708 sp=0xc00021a6c0 pc=0x5afe89b34685 net.(*conn).Read(0xc000198008, {0xc0000b0e21?, 0x0?, 0x0?}) net/net.go:185 +0x45 fp=0xc00021a750 sp=0xc00021a708 pc=0x5afe89b3e945 net.(*TCPConn).Read(0x0?, {0xc0000b0e21?, 0x0?, 0x0?}) <autogenerated>:1 +0x25 fp=0xc00021a780 sp=0xc00021a750 pc=0x5afe89b4a325 net/http.(*connReader).backgroundRead(0xc0000b0e10) net/http/server.go:681 +0x37 fp=0xc00021a7c8 sp=0xc00021a780 pc=0x5afe89c590b7 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:677 +0x25 fp=0xc00021a7e0 sp=0xc00021a7c8 pc=0x5afe89c58fe5 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00021a7e8 sp=0xc00021a7e0 pc=0x5afe89a7bce1 created by net/http.(*connReader).startBackgroundRead in goroutine 18 net/http/server.go:677 +0xba rax 0x0 rbx 0x73b8cd600000 rcx 0x73b94186d9fc rdx 0x6 rdi 0x1e rsi 0x20 rbp 0x20 rsp 0x73b8cd5ddcb0 r8 0x73b8cd5ddd80 r9 0x73b8cd5ddd50 r10 0x8 r11 0x246 r12 0x6 r13 0x16 r14 0x0 r15 0x73b8cd5de090 rip 0x73b94186d9fc rflags 0x246 cs 0x33 fs 0x0 gs 0x0 2024-10-31T08:35:32.735Z level=INFO source=routes.go:490 msg="embedding generation failed: do embedding request: Post \"http://127.0.0.1:43231/embedding\": EOF" [GIN] 2024/10/31 - 08:35:32 | 500 | 1.690382474s | 172.21.0.150 | POST "/api/embeddings" 2024-10-31T08:35:32.735Z level=DEBUG source=sched.go:467 msg="context for request finished" 2024-10-31T08:35:32.735Z level=DEBUG source=sched.go:340 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 duration=5m0s 2024-10-31T08:35:32.735Z level=DEBUG source=sched.go:358 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-6f14709dbba1a24fa7f9c8ffc8e8a5b6c6657555045741912e79fc92907d0ec4 refCount=0 2024-10-31T08:35:32.751Z level=DEBUG source=llama-server.go:395 msg="llama runner terminated" error="exit status 2" ```
Author
Owner

@farooquiowais commented on GitHub (Jan 9, 2025):

@PierreMesure I am running ollama version 0.5.4 i have the same error, were you able to find the solution?

<!-- gh-comment-id:2580962758 --> @farooquiowais commented on GitHub (Jan 9, 2025): @PierreMesure I am running ollama version 0.5.4 i have the same error, were you able to find the solution?
Author
Owner

@IHadAFish commented on GitHub (Jan 10, 2025):

@farooquiowais
For me the issue ended up being the input was too large for the model I was using.

<!-- gh-comment-id:2581992248 --> @IHadAFish commented on GitHub (Jan 10, 2025): @farooquiowais For me the issue ended up being the input was too large for the model I was using.
Author
Owner

@PierreMesure commented on GitHub (Jan 10, 2025):

Hi, I can't tell you if the problem was solved or not in the end, we are now using other models and Ollama versions that work without issues. I'll do new tests in the coming weeks.

@IHadAFish, that's interesting to hear! I'll check it as well and if that's the case, Ollama could benefit from getting a more explicit error message.

<!-- gh-comment-id:2582021736 --> @PierreMesure commented on GitHub (Jan 10, 2025): Hi, I can't tell you if the problem was solved or not in the end, we are now using other models and Ollama versions that work without issues. I'll do new tests in the coming weeks. @IHadAFish, that's interesting to hear! I'll check it as well and if that's the case, Ollama could benefit from getting a more explicit error message.
Author
Owner

@dayeguilaiye commented on GitHub (Feb 10, 2025):

@jmorganca I have the same problem now in 0.5.7. I think this problem may not be solved yet. This issue should be opened.

<!-- gh-comment-id:2647591059 --> @dayeguilaiye commented on GitHub (Feb 10, 2025): @jmorganca I have the same problem now in 0.5.7. I think this problem may not be solved yet. This issue should be opened.
Author
Owner

@vietvudanh commented on GitHub (Mar 6, 2025):

Still got error on 0.5.13.

<!-- gh-comment-id:2702725457 --> @vietvudanh commented on GitHub (Mar 6, 2025): Still got error on 0.5.13.
Author
Owner

@PierreMesure commented on GitHub (Mar 6, 2025):

It's definitely also an Ollama problem, the error message should be better and it could be handled more gracefully.

But we got this problem using LlamaIndex and ended up fixing it in their code, specifically in this class.

import asyncio
from typing import Any, Dict, List, Optional

from llama_index.core.base.embeddings.base import BaseEmbedding
from llama_index.core.bridge.pydantic import Field, PrivateAttr
from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.constants import DEFAULT_EMBED_BATCH_SIZE

from ollama import Client, AsyncClient
from transformers import AutoTokenizer


class OllamaEmbedding(BaseEmbedding):
    """Class for Ollama embeddings."""

    base_url: str = Field(description="Base url the model is hosted by Ollama")
    model_name: str = Field(description="The Ollama model to use.")
    embed_batch_size: int = Field(
        default=DEFAULT_EMBED_BATCH_SIZE,
        description="The batch size for embedding calls.",
        gt=0,
        le=2048,
    )
    ollama_additional_kwargs: Dict[str, Any] = Field(
        default_factory=dict, description="Additional kwargs for the Ollama API."
    )

    _client: Client = PrivateAttr()
    _async_client: AsyncClient = PrivateAttr()

    def __init__(
        self,
        model_name: str,
        base_url: str = "http://localhost:11434",
        embed_batch_size: int = DEFAULT_EMBED_BATCH_SIZE,
        ollama_additional_kwargs: Optional[Dict[str, Any]] = None,
        callback_manager: Optional[CallbackManager] = None,
        **kwargs: Any,
    ) -> None:
        super().__init__(
            model_name=model_name,
            base_url=base_url,
            embed_batch_size=embed_batch_size,
            ollama_additional_kwargs=ollama_additional_kwargs or {},
            callback_manager=callback_manager,
            **kwargs,
        )

        self._client = Client(host=self.base_url)
        self._async_client = AsyncClient(host=self.base_url)
        self._e5_tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-large-instruct")

    @classmethod
    def class_name(cls) -> str:
        return "OllamaEmbedding"

    def _get_query_embedding(self, query: str) -> List[float]:
        """Get query embedding."""
        return self.get_general_text_embedding(query)

    async def _aget_query_embedding(self, query: str) -> List[float]:
        """The asynchronous version of _get_query_embedding."""
        return await self.aget_general_text_embedding(query)

    def _get_text_embedding(self, text: str) -> List[float]:
        """Get text embedding."""
        return self.get_general_text_embedding(text)

    async def _aget_text_embedding(self, text: str) -> List[float]:
        """Asynchronously get text embedding."""
        return await self.aget_general_text_embedding(text)

    def _get_text_embeddings(self, texts: List[str]) -> List[List[float]]:
        """Get text embeddings."""
        embeddings_list: List[List[float]] = []
        for text in texts:
            embeddings = self.get_general_text_embedding(text)
            embeddings_list.append(embeddings)

        return embeddings_list

    async def _aget_text_embeddings(self, texts: List[str]) -> List[List[float]]:
        """Asynchronously get text embeddings."""
        return await asyncio.gather(
            *[self.aget_general_text_embedding(text) for text in texts]
        )

    def get_general_text_embedding(self, text: str) -> List[float]:
        """Get Ollama embedding."""
        if "e5-large" in self.model_name:
            text = self._truncate_e5_text(text)
        result = self._client.embeddings(
            model=self.model_name, prompt=text, options=self.ollama_additional_kwargs
        )
        return result["embedding"]

    async def aget_general_text_embedding(self, text: str) -> List[float]:
        """Asynchronously get Ollama embedding."""
        if "e5-large" in self.model_name:
            text = self._truncate_e5_text(text)
        result = await self._async_client.embeddings(
            model=self.model_name, prompt=text, options=self.ollama_additional_kwargs
        )
        return result["embedding"]
    
    def _truncate_e5_text(self, text):
        truncated_encoding = self._e5_tokenizer(text, max_length=self._e5_tokenizer.model_max_length, truncation=True)
        truncated_text = self._e5_tokenizer.decode(truncated_encoding['input_ids'], skip_special_tokens=True)
        return truncated_text
<!-- gh-comment-id:2703006604 --> @PierreMesure commented on GitHub (Mar 6, 2025): It's definitely also an Ollama problem, the error message should be better and it could be handled more gracefully. But we got this problem using LlamaIndex and ended up fixing it in their code, specifically in [this class](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/embeddings/llama-index-embeddings-ollama/llama_index/embeddings/ollama/base.py). ```python import asyncio from typing import Any, Dict, List, Optional from llama_index.core.base.embeddings.base import BaseEmbedding from llama_index.core.bridge.pydantic import Field, PrivateAttr from llama_index.core.callbacks.base import CallbackManager from llama_index.core.constants import DEFAULT_EMBED_BATCH_SIZE from ollama import Client, AsyncClient from transformers import AutoTokenizer class OllamaEmbedding(BaseEmbedding): """Class for Ollama embeddings.""" base_url: str = Field(description="Base url the model is hosted by Ollama") model_name: str = Field(description="The Ollama model to use.") embed_batch_size: int = Field( default=DEFAULT_EMBED_BATCH_SIZE, description="The batch size for embedding calls.", gt=0, le=2048, ) ollama_additional_kwargs: Dict[str, Any] = Field( default_factory=dict, description="Additional kwargs for the Ollama API." ) _client: Client = PrivateAttr() _async_client: AsyncClient = PrivateAttr() def __init__( self, model_name: str, base_url: str = "http://localhost:11434", embed_batch_size: int = DEFAULT_EMBED_BATCH_SIZE, ollama_additional_kwargs: Optional[Dict[str, Any]] = None, callback_manager: Optional[CallbackManager] = None, **kwargs: Any, ) -> None: super().__init__( model_name=model_name, base_url=base_url, embed_batch_size=embed_batch_size, ollama_additional_kwargs=ollama_additional_kwargs or {}, callback_manager=callback_manager, **kwargs, ) self._client = Client(host=self.base_url) self._async_client = AsyncClient(host=self.base_url) self._e5_tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-large-instruct") @classmethod def class_name(cls) -> str: return "OllamaEmbedding" def _get_query_embedding(self, query: str) -> List[float]: """Get query embedding.""" return self.get_general_text_embedding(query) async def _aget_query_embedding(self, query: str) -> List[float]: """The asynchronous version of _get_query_embedding.""" return await self.aget_general_text_embedding(query) def _get_text_embedding(self, text: str) -> List[float]: """Get text embedding.""" return self.get_general_text_embedding(text) async def _aget_text_embedding(self, text: str) -> List[float]: """Asynchronously get text embedding.""" return await self.aget_general_text_embedding(text) def _get_text_embeddings(self, texts: List[str]) -> List[List[float]]: """Get text embeddings.""" embeddings_list: List[List[float]] = [] for text in texts: embeddings = self.get_general_text_embedding(text) embeddings_list.append(embeddings) return embeddings_list async def _aget_text_embeddings(self, texts: List[str]) -> List[List[float]]: """Asynchronously get text embeddings.""" return await asyncio.gather( *[self.aget_general_text_embedding(text) for text in texts] ) def get_general_text_embedding(self, text: str) -> List[float]: """Get Ollama embedding.""" if "e5-large" in self.model_name: text = self._truncate_e5_text(text) result = self._client.embeddings( model=self.model_name, prompt=text, options=self.ollama_additional_kwargs ) return result["embedding"] async def aget_general_text_embedding(self, text: str) -> List[float]: """Asynchronously get Ollama embedding.""" if "e5-large" in self.model_name: text = self._truncate_e5_text(text) result = await self._async_client.embeddings( model=self.model_name, prompt=text, options=self.ollama_additional_kwargs ) return result["embedding"] def _truncate_e5_text(self, text): truncated_encoding = self._e5_tokenizer(text, max_length=self._e5_tokenizer.model_max_length, truncation=True) truncated_text = self._e5_tokenizer.decode(truncated_encoding['input_ids'], skip_special_tokens=True) return truncated_text ```
Author
Owner

@shui-dun commented on GitHub (Mar 22, 2025):

@farooquiowais For me the issue ended up being the input was too large for the model I was using.

Similarly, I've encountered issues where certain embedding models throw errors when processing excessively long inputs: "Post "http://127.0.0.1:33967/embedding": EOF". This occurs even when the "truncate": true parameter is set. However, other embedding models handle such inputs without any problems.

<!-- gh-comment-id:2744872432 --> @shui-dun commented on GitHub (Mar 22, 2025): > [@farooquiowais](https://github.com/farooquiowais) For me the issue ended up being the input was too large for the model I was using. Similarly, I've encountered issues where certain embedding models throw errors when processing excessively long inputs: "Post \"http://127.0.0.1:33967/embedding\": EOF". This occurs even when the "truncate": true parameter is set. However, other embedding models handle such inputs without any problems.
Author
Owner

@lizhen789 commented on GitHub (Apr 9, 2025):

Still got error on 0.6.5.

<!-- gh-comment-id:2789860592 --> @lizhen789 commented on GitHub (Apr 9, 2025): Still got error on 0.6.5.
Author
Owner

@kwanLeeFrmVi commented on GitHub (Apr 9, 2025):

same here, Ollama version 0.6.5 MacOS

  • got error message Post "http://127.0.0.1:55991/embedding": EOF while ollama api run at port: 11435 and model is granite-embedding:278m
  • I don't see this error if using nomic-embed-text
    or run granite-embedding:278m in LM Studio ( same model from hugface )
<!-- gh-comment-id:2790891357 --> @kwanLeeFrmVi commented on GitHub (Apr 9, 2025): same here, Ollama version 0.6.5 MacOS - got error message Post `"http://127.0.0.1:55991/embedding": EOF` while ollama api run at port: `11435` and model is `granite-embedding:278m` - I don't see this error if using `nomic-embed-text` or run `granite-embedding:278m` in LM Studio ( same model from hugface )
Author
Owner

@geekypilot commented on GitHub (May 16, 2025):

I just got exactly the same issue while running granite-embedding:278m on version 0.7.0.

ResponseError: do embedding request: Post "http://127.0.0.1:50544/embedding": EOF (status code: 500)

<!-- gh-comment-id:2885427693 --> @geekypilot commented on GitHub (May 16, 2025): I just got exactly the same issue while running `granite-embedding:278m` on version 0.7.0. ResponseError: do embedding request: Post "http://127.0.0.1:50544/embedding": EOF (status code: 500)
Author
Owner

@marcjulianschwarz commented on GitHub (May 22, 2025):

@farooquiowais For me the issue ended up being the input was too large for the model I was using.

Similarly, I've encountered issues where certain embedding models throw errors when processing excessively long inputs: "Post "[http://127.0.0.1:33967/embedding](http://127.0.0.1:33967/embedding%5C)": EOF". This occurs even when the "truncate": true parameter is set. However, other embedding models handle such inputs without any problems.

Had the same experience. When manually truncating the input for models with relatively small input size, the error does not appear anymore. The truncate parameter did not help.

Ollama version: 0.7.0

<!-- gh-comment-id:2900181899 --> @marcjulianschwarz commented on GitHub (May 22, 2025): > > [@farooquiowais](https://github.com/farooquiowais) For me the issue ended up being the input was too large for the model I was using. > > Similarly, I've encountered issues where certain embedding models throw errors when processing excessively long inputs: "Post "[http://127.0.0.1:33967/embedding\](http://127.0.0.1:33967/embedding%5C)": EOF". This occurs even when the "truncate": true parameter is set. However, other embedding models handle such inputs without any problems. Had the same experience. When manually truncating the input for models with relatively small input size, the error does not appear anymore. The `truncate` parameter did not help. Ollama version: 0.7.0
Author
Owner

@probaku1234 commented on GitHub (Jun 5, 2025):

I had the same issue with granite-embedding:278m on version 0.9.0.

ResponseError: do embedding request: Post "http://127.0.0.1:50544/embedding": EOF (status code: 500)

For me, setting num_ctx to 512 resolved the issue.

<!-- gh-comment-id:2942481189 --> @probaku1234 commented on GitHub (Jun 5, 2025): I had the same issue with `granite-embedding:278m` on version 0.9.0. ResponseError: do embedding request: Post "http://127.0.0.1:50544/embedding": EOF (status code: 500) For me, setting `num_ctx` to 512 resolved the issue.
Author
Owner

@spaboy commented on GitHub (Jun 7, 2025):

I had the same issue with granite-embedding:278m on version 0.9.0.

ResponseError: do embedding request: Post "http://127.0.0.1:50544/embedding": EOF (status code: 500)

For me, setting num_ctx to 512 resolved the issue.

This worked for me also on mxbai-embed-large embed.

Thanks!

<!-- gh-comment-id:2952272158 --> @spaboy commented on GitHub (Jun 7, 2025): > I had the same issue with `granite-embedding:278m` on version 0.9.0. > > ResponseError: do embedding request: Post "http://127.0.0.1:50544/embedding": EOF (status code: 500) > > For me, setting `num_ctx` to 512 resolved the issue. This worked for me also on [mxbai-embed-large](https://ollama.com/library/mxbai-embed-large) embed. Thanks!
Author
Owner

@tiankonguse commented on GitHub (Oct 20, 2025):

I had the same issue with granite-embedding:278m on version 0.9.0.

ResponseError: do embedding request: Post "http://127.0.0.1:50544/embedding": EOF (status code: 500)

For me, setting num_ctx to 512 resolved the issue.

Thanks,This worked for me also on bge-m3:567m embed.

<!-- gh-comment-id:3421660168 --> @tiankonguse commented on GitHub (Oct 20, 2025): > I had the same issue with `granite-embedding:278m` on version 0.9.0. > > ResponseError: do embedding request: Post "http://127.0.0.1:50544/embedding": EOF (status code: 500) > > For me, setting `num_ctx` to 512 resolved the issue. Thanks,This worked for me also on bge-m3:567m embed.
Author
Owner

@lukmanottun-hero commented on GitHub (Oct 22, 2025):

I noticed this error occurs when the input size is large. Truncating the input size solves the error for me.

<!-- gh-comment-id:3431037640 --> @lukmanottun-hero commented on GitHub (Oct 22, 2025): I noticed this error occurs when the input size is large. Truncating the input size solves the error for me.
Author
Owner

@v-asad commented on GitHub (Dec 15, 2025):

I had the same issue with granite-embedding:278m on version 0.9.0.

ResponseError: do embedding request: Post "http://127.0.0.1:50544/embedding": EOF (status code: 500)

For me, setting num_ctx to 512 resolved the issue.

Thanks!
This worked for me as well.
I'm using nomic-embed-text.

<!-- gh-comment-id:3655014810 --> @v-asad commented on GitHub (Dec 15, 2025): > I had the same issue with `granite-embedding:278m` on version 0.9.0. > > ResponseError: do embedding request: Post "http://127.0.0.1:50544/embedding": EOF (status code: 500) > > For me, setting `num_ctx` to 512 resolved the issue. Thanks! This worked for me as well. I'm using `nomic-embed-text`.
Author
Owner

@noobie-bob commented on GitHub (Dec 18, 2025):

Got similar error with nomic-embed code, has any used nomic-embed-code successfully with ollama ?

ollama run manutic/nomic-embed-code:7b-Q4_K_M "def factorial"
Error: do embedding request: Post "http://127.0.0.1:41255/embedding": EOF

Machine linux, docker container, RHEL 8 , 32GB RAM, 8 CORE CPU, latest ollama 0.13.3,

ollama -v
ollama version is 0.13.3

similar to https://github.com/ollama/ollama/issues/8140, https://github.com/ollama/ollama/issues/12585 from traceback

Backtrace log

[GIN] 2025/12/18 - 09:57:27 | 200 |      28.651µs |       127.0.0.1 | HEAD     "/"
time=2025-12-18T09:57:27.651Z level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32
[GIN] 2025/12/18 - 09:57:27 | 200 |   55.429525ms |       127.0.0.1 | POST     "/api/show"
time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7
time=2025-12-18T09:57:27.730Z level=DEBUG source=server.go:1291 msg="server unhealthy" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed"
time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7
time=2025-12-18T09:57:27.730Z level=DEBUG source=server.go:1291 msg="server unhealthy" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed"
time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:161 msg=reloading runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096
time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:236 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096 refCount=0
time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:247 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096
time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:311 msg="runner expired event received" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096
time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:326 msg="got lock to unload expired event" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096
time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:349 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096
time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:674 msg="no need to wait for VRAM recovery" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096
time=2025-12-18T09:57:27.770Z level=DEBUG source=server.go:1766 msg="stopping llama server" pid=147
time=2025-12-18T09:57:27.770Z level=DEBUG source=sched.go:358 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7
time=2025-12-18T09:57:27.770Z level=DEBUG source=sched.go:361 msg="sending an unloaded event" runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7
time=2025-12-18T09:57:27.770Z level=DEBUG source=sched.go:253 msg="unload completed" runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7
time=2025-12-18T09:57:27.770Z level=DEBUG source=runner.go:264 msg="refreshing free memory"
time=2025-12-18T09:57:27.770Z level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=8.479µs
time=2025-12-18T09:57:27.771Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
time=2025-12-18T09:57:27.787Z level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32
time=2025-12-18T09:57:27.787Z level=DEBUG source=sched.go:211 msg="loading first model" model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7
llama_model_loader: loaded meta data with 56 key-value pairs and 338 tensors from /root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = nomic-embed-code-f16-gguf
llama_model_loader: - kv   3:                         general.size_label str              = 7.1B
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                   general.base_model.count u32              = 1
llama_model_loader: - kv   6:                  general.base_model.0.name str              = Qwen2.5 Coder 7B Instruct
llama_model_loader: - kv   7:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv   8:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv   9:                      general.dataset.count u32              = 6
llama_model_loader: - kv  10:                     general.dataset.0.name str              = Cornstack Python v1
llama_model_loader: - kv  11:                  general.dataset.0.version str              = v1
llama_model_loader: - kv  12:             general.dataset.0.organization str              = Nomic Ai
llama_model_loader: - kv  13:                 general.dataset.0.repo_url str              = https://huggingface.co/nomic-ai/corns...
llama_model_loader: - kv  14:                     general.dataset.1.name str              = Cornstack Javascript v1
llama_model_loader: - kv  15:                  general.dataset.1.version str              = v1
llama_model_loader: - kv  16:             general.dataset.1.organization str              = Nomic Ai
llama_model_loader: - kv  17:                 general.dataset.1.repo_url str              = https://huggingface.co/nomic-ai/corns...
llama_model_loader: - kv  18:                     general.dataset.2.name str              = Cornstack Java v1
llama_model_loader: - kv  19:                  general.dataset.2.version str              = v1
llama_model_loader: - kv  20:             general.dataset.2.organization str              = Nomic Ai
llama_model_loader: - kv  21:                 general.dataset.2.repo_url str              = https://huggingface.co/nomic-ai/corns...
llama_model_loader: - kv  22:                     general.dataset.3.name str              = Cornstack Go v1
llama_model_loader: - kv  23:                  general.dataset.3.version str              = v1
llama_model_loader: - kv  24:             general.dataset.3.organization str              = Nomic Ai
llama_model_loader: - kv  25:                 general.dataset.3.repo_url str              = https://huggingface.co/nomic-ai/corns...
llama_model_loader: - kv  26:                     general.dataset.4.name str              = Cornstack Php v1
llama_model_loader: - kv  27:                  general.dataset.4.version str              = v1
llama_model_loader: - kv  28:             general.dataset.4.organization str              = Nomic Ai
llama_model_loader: - kv  29:                 general.dataset.4.repo_url str              = https://huggingface.co/nomic-ai/corns...
llama_model_loader: - kv  30:                     general.dataset.5.name str              = Cornstack Ruby v1
llama_model_loader: - kv  31:                  general.dataset.5.version str              = v1
llama_model_loader: - kv  32:             general.dataset.5.organization str              = Nomic Ai
llama_model_loader: - kv  33:                 general.dataset.5.repo_url str              = https://huggingface.co/nomic-ai/corns...
llama_model_loader: - kv  34:                               general.tags arr[str,4]       = ["sentence-transformers", "sentence-s...
llama_model_loader: - kv  35:                          qwen2.block_count u32              = 28
llama_model_loader: - kv  36:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  37:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv  38:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  39:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  40:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  41:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  42:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  43:                         qwen2.pooling_type u32              = 3
llama_model_loader: - kv  44:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  45:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  46:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  47:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  48:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  49:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  50:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  51:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  52:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  53:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  54:               general.quantization_version u32              = 2
llama_model_loader: - kv  55:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.07 GiB (4.95 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 7.07 B
print_info: general.name     = nomic-embed-code-f16-gguf
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-12-18T09:57:28.093Z level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 --port 36061"
time=2025-12-18T09:57:28.093Z level=DEBUG source=server.go:393 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_DEBUG=1 LD_LIBRARY_PATH=/usr/lib/ollama:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_LIBRARY_PATH=/usr/lib/ollama
time=2025-12-18T09:57:28.093Z level=INFO source=sched.go:443 msg="system memory" total="31.1 GiB" free="30.9 GiB" free_swap="0 B"
time=2025-12-18T09:57:28.093Z level=INFO source=server.go:459 msg="loading model" "model layers"=29 requested=-1
time=2025-12-18T09:57:28.093Z level=INFO source=server.go:481 msg="embedding model detected, setting batch size to context length" batch_size=4096
time=2025-12-18T09:57:28.093Z level=DEBUG source=ggml.go:279 msg="key with type not found" key=qwen2.attention.key_length default=128
time=2025-12-18T09:57:28.093Z level=DEBUG source=ggml.go:279 msg="key with type not found" key=qwen2.attention.value_length default=128
time=2025-12-18T09:57:28.093Z level=DEBUG source=ggml.go:614 msg="default cache size estimate" "attention MiB"=224 "attention bytes"=234881024 "recurrent MiB"=0 "recurrent bytes"=0
time=2025-12-18T09:57:28.093Z level=DEBUG source=server.go:621 msg=memory estimate.CPU.Weights="[149112832 149112832 149112832 131135488 131135488 149112832 131135488 131135488 149112832 131135488 131135488 149112832 131135488 131135488 149112832 131135488 131135488 149112832 131135488 131135488 149112832 131135488 131135488 149112832 149112832 149112832 149112832 149112832 447082496]" estimate.CPU.Cache="[8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 0]"
time=2025-12-18T09:57:28.093Z level=INFO source=device.go:245 msg="model weights" device=CPU size="4.1 GiB"
time=2025-12-18T09:57:28.093Z level=INFO source=device.go:256 msg="kv cache" device=CPU size="224.0 MiB"
time=2025-12-18T09:57:28.094Z level=INFO source=device.go:272 msg="total memory" size="4.3 GiB"
time=2025-12-18T09:57:28.105Z level=INFO source=runner.go:964 msg="starting go runner"
time=2025-12-18T09:57:28.105Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-skylakex.so
time=2025-12-18T09:57:28.111Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-12-18T09:57:28.111Z level=INFO source=runner.go:1000 msg="Server listening on 127.0.0.1:36061"
time=2025-12-18T09:57:28.115Z level=INFO source=runner.go:894 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:4096 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-12-18T09:57:28.115Z level=INFO source=server.go:1301 msg="waiting for llama runner to start responding"
time=2025-12-18T09:57:28.116Z level=INFO source=server.go:1335 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 56 key-value pairs and 338 tensors from /root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = nomic-embed-code-f16-gguf
llama_model_loader: - kv   3:                         general.size_label str              = 7.1B
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                   general.base_model.count u32              = 1
llama_model_loader: - kv   6:                  general.base_model.0.name str              = Qwen2.5 Coder 7B Instruct
llama_model_loader: - kv   7:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv   8:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv   9:                      general.dataset.count u32              = 6
llama_model_loader: - kv  10:                     general.dataset.0.name str              = Cornstack Python v1
llama_model_loader: - kv  11:                  general.dataset.0.version str              = v1
llama_model_loader: - kv  12:             general.dataset.0.organization str              = Nomic Ai
llama_model_loader: - kv  13:                 general.dataset.0.repo_url str              = https://huggingface.co/nomic-ai/corns...
llama_model_loader: - kv  14:                     general.dataset.1.name str              = Cornstack Javascript v1
llama_model_loader: - kv  15:                  general.dataset.1.version str              = v1
llama_model_loader: - kv  16:             general.dataset.1.organization str              = Nomic Ai
llama_model_loader: - kv  17:                 general.dataset.1.repo_url str              = https://huggingface.co/nomic-ai/corns...
llama_model_loader: - kv  18:                     general.dataset.2.name str              = Cornstack Java v1
llama_model_loader: - kv  19:                  general.dataset.2.version str              = v1
llama_model_loader: - kv  20:             general.dataset.2.organization str              = Nomic Ai
llama_model_loader: - kv  21:                 general.dataset.2.repo_url str              = https://huggingface.co/nomic-ai/corns...
llama_model_loader: - kv  22:                     general.dataset.3.name str              = Cornstack Go v1
llama_model_loader: - kv  23:                  general.dataset.3.version str              = v1
llama_model_loader: - kv  24:             general.dataset.3.organization str              = Nomic Ai
llama_model_loader: - kv  25:                 general.dataset.3.repo_url str              = https://huggingface.co/nomic-ai/corns...
llama_model_loader: - kv  26:                     general.dataset.4.name str              = Cornstack Php v1
llama_model_loader: - kv  27:                  general.dataset.4.version str              = v1
llama_model_loader: - kv  28:             general.dataset.4.organization str              = Nomic Ai
llama_model_loader: - kv  29:                 general.dataset.4.repo_url str              = https://huggingface.co/nomic-ai/corns...
llama_model_loader: - kv  30:                     general.dataset.5.name str              = Cornstack Ruby v1
llama_model_loader: - kv  31:                  general.dataset.5.version str              = v1
llama_model_loader: - kv  32:             general.dataset.5.organization str              = Nomic Ai
llama_model_loader: - kv  33:                 general.dataset.5.repo_url str              = https://huggingface.co/nomic-ai/corns...
llama_model_loader: - kv  34:                               general.tags arr[str,4]       = ["sentence-transformers", "sentence-s...
llama_model_loader: - kv  35:                          qwen2.block_count u32              = 28
llama_model_loader: - kv  36:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  37:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv  38:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  39:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  40:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  41:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  42:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  43:                         qwen2.pooling_type u32              = 3
llama_model_loader: - kv  44:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  45:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  46:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  47:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  48:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  49:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  50:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  51:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  52:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  53:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  54:               general.quantization_version u32              = 2
llama_model_loader: - kv  55:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.07 GiB (4.95 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 3584
print_info: n_embd_inp       = 3584
print_info: n_layer          = 28
print_info: n_head           = 28
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18944
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 3
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: model type       = 7B
print_info: model params     = 7.07 B
print_info: general.name     = nomic-embed-code-f16-gguf
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 0
load_tensors: layer   5 assigned to device CPU, is_swa = 0
load_tensors: layer   6 assigned to device CPU, is_swa = 0
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 0
load_tensors: layer   9 assigned to device CPU, is_swa = 0
load_tensors: layer  10 assigned to device CPU, is_swa = 0
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 0
load_tensors: layer  13 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CPU, is_swa = 0
load_tensors: layer  15 assigned to device CPU, is_swa = 0
load_tensors: layer  16 assigned to device CPU, is_swa = 0
load_tensors: layer  17 assigned to device CPU, is_swa = 0
load_tensors: layer  18 assigned to device CPU, is_swa = 0
load_tensors: layer  19 assigned to device CPU, is_swa = 0
load_tensors: layer  20 assigned to device CPU, is_swa = 0
load_tensors: layer  21 assigned to device CPU, is_swa = 0
load_tensors: layer  22 assigned to device CPU, is_swa = 0
load_tensors: layer  23 assigned to device CPU, is_swa = 0
load_tensors: layer  24 assigned to device CPU, is_swa = 0
load_tensors: layer  25 assigned to device CPU, is_swa = 0
load_tensors: layer  26 assigned to device CPU, is_swa = 0
load_tensors: layer  27 assigned to device CPU, is_swa = 0
load_tensors: layer  28 assigned to device CPU, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.attn_q.bias
create_tensor: loading tensor blk.0.attn_k.bias
create_tensor: loading tensor blk.0.attn_v.bias
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight
create_tensor: loading tensor blk.1.attn_k.weight
create_tensor: loading tensor blk.1.attn_v.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.attn_q.bias
create_tensor: loading tensor blk.1.attn_k.bias
create_tensor: loading tensor blk.1.attn_v.bias
create_tensor: loading tensor blk.1.ffn_norm.weight
create_tensor: loading tensor blk.1.ffn_gate.weight
create_tensor: loading tensor blk.1.ffn_down.weight
create_tensor: loading tensor blk.1.ffn_up.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.attn_q.weight
create_tensor: loading tensor blk.2.attn_k.weight
create_tensor: loading tensor blk.2.attn_v.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.attn_q.bias
create_tensor: loading tensor blk.2.attn_k.bias
create_tensor: loading tensor blk.2.attn_v.bias
create_tensor: loading tensor blk.2.ffn_norm.weight
create_tensor: loading tensor blk.2.ffn_gate.weight
create_tensor: loading tensor blk.2.ffn_down.weight
create_tensor: loading tensor blk.2.ffn_up.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.attn_q.weight
create_tensor: loading tensor blk.3.attn_k.weight
create_tensor: loading tensor blk.3.attn_v.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.attn_q.bias
create_tensor: loading tensor blk.3.attn_k.bias
create_tensor: loading tensor blk.3.attn_v.bias
create_tensor: loading tensor blk.3.ffn_norm.weight
create_tensor: loading tensor blk.3.ffn_gate.weight
create_tensor: loading tensor blk.3.ffn_down.weight
create_tensor: loading tensor blk.3.ffn_up.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.attn_q.weight
create_tensor: loading tensor blk.4.attn_k.weight
create_tensor: loading tensor blk.4.attn_v.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.attn_q.bias
create_tensor: loading tensor blk.4.attn_k.bias
create_tensor: loading tensor blk.4.attn_v.bias
create_tensor: loading tensor blk.4.ffn_norm.weight
create_tensor: loading tensor blk.4.ffn_gate.weight
create_tensor: loading tensor blk.4.ffn_down.weight
create_tensor: loading tensor blk.4.ffn_up.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.attn_q.weight
create_tensor: loading tensor blk.5.attn_k.weight
create_tensor: loading tensor blk.5.attn_v.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.attn_q.bias
create_tensor: loading tensor blk.5.attn_k.bias
create_tensor: loading tensor blk.5.attn_v.bias
create_tensor: loading tensor blk.5.ffn_norm.weight
create_tensor: loading tensor blk.5.ffn_gate.weight
create_tensor: loading tensor blk.5.ffn_down.weight
create_tensor: loading tensor blk.5.ffn_up.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.attn_q.weight
create_tensor: loading tensor blk.6.attn_k.weight
create_tensor: loading tensor blk.6.attn_v.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.attn_q.bias
create_tensor: loading tensor blk.6.attn_k.bias
create_tensor: loading tensor blk.6.attn_v.bias
create_tensor: loading tensor blk.6.ffn_norm.weight
create_tensor: loading tensor blk.6.ffn_gate.weight
create_tensor: loading tensor blk.6.ffn_down.weight
create_tensor: loading tensor blk.6.ffn_up.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.attn_q.weight
create_tensor: loading tensor blk.7.attn_k.weight
create_tensor: loading tensor blk.7.attn_v.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.attn_q.bias
create_tensor: loading tensor blk.7.attn_k.bias
create_tensor: loading tensor blk.7.attn_v.bias
create_tensor: loading tensor blk.7.ffn_norm.weight
create_tensor: loading tensor blk.7.ffn_gate.weight
create_tensor: loading tensor blk.7.ffn_down.weight
create_tensor: loading tensor blk.7.ffn_up.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.attn_q.weight
create_tensor: loading tensor blk.8.attn_k.weight
create_tensor: loading tensor blk.8.attn_v.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.attn_q.bias
create_tensor: loading tensor blk.8.attn_k.bias
create_tensor: loading tensor blk.8.attn_v.bias
create_tensor: loading tensor blk.8.ffn_norm.weight
create_tensor: loading tensor blk.8.ffn_gate.weight
create_tensor: loading tensor blk.8.ffn_down.weight
create_tensor: loading tensor blk.8.ffn_up.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.attn_q.weight
create_tensor: loading tensor blk.9.attn_k.weight
create_tensor: loading tensor blk.9.attn_v.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.attn_q.bias
create_tensor: loading tensor blk.9.attn_k.bias
create_tensor: loading tensor blk.9.attn_v.bias
create_tensor: loading tensor blk.9.ffn_norm.weight
create_tensor: loading tensor blk.9.ffn_gate.weight
create_tensor: loading tensor blk.9.ffn_down.weight
create_tensor: loading tensor blk.9.ffn_up.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.attn_q.weight
create_tensor: loading tensor blk.10.attn_k.weight
create_tensor: loading tensor blk.10.attn_v.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.attn_q.bias
create_tensor: loading tensor blk.10.attn_k.bias
create_tensor: loading tensor blk.10.attn_v.bias
create_tensor: loading tensor blk.10.ffn_norm.weight
create_tensor: loading tensor blk.10.ffn_gate.weight
create_tensor: loading tensor blk.10.ffn_down.weight
create_tensor: loading tensor blk.10.ffn_up.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.attn_q.weight
create_tensor: loading tensor blk.11.attn_k.weight
create_tensor: loading tensor blk.11.attn_v.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.attn_q.bias
create_tensor: loading tensor blk.11.attn_k.bias
create_tensor: loading tensor blk.11.attn_v.bias
create_tensor: loading tensor blk.11.ffn_norm.weight
create_tensor: loading tensor blk.11.ffn_gate.weight
create_tensor: loading tensor blk.11.ffn_down.weight
create_tensor: loading tensor blk.11.ffn_up.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.attn_q.weight
create_tensor: loading tensor blk.12.attn_k.weight
create_tensor: loading tensor blk.12.attn_v.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.attn_q.bias
create_tensor: loading tensor blk.12.attn_k.bias
create_tensor: loading tensor blk.12.attn_v.bias
create_tensor: loading tensor blk.12.ffn_norm.weight
create_tensor: loading tensor blk.12.ffn_gate.weight
create_tensor: loading tensor blk.12.ffn_down.weight
create_tensor: loading tensor blk.12.ffn_up.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.attn_q.weight
create_tensor: loading tensor blk.13.attn_k.weight
create_tensor: loading tensor blk.13.attn_v.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.attn_q.bias
create_tensor: loading tensor blk.13.attn_k.bias
create_tensor: loading tensor blk.13.attn_v.bias
create_tensor: loading tensor blk.13.ffn_norm.weight
create_tensor: loading tensor blk.13.ffn_gate.weight
create_tensor: loading tensor blk.13.ffn_down.weight
create_tensor: loading tensor blk.13.ffn_up.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.attn_q.weight
create_tensor: loading tensor blk.14.attn_k.weight
create_tensor: loading tensor blk.14.attn_v.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.attn_q.bias
create_tensor: loading tensor blk.14.attn_k.bias
create_tensor: loading tensor blk.14.attn_v.bias
create_tensor: loading tensor blk.14.ffn_norm.weight
create_tensor: loading tensor blk.14.ffn_gate.weight
create_tensor: loading tensor blk.14.ffn_down.weight
create_tensor: loading tensor blk.14.ffn_up.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.attn_q.weight
create_tensor: loading tensor blk.15.attn_k.weight
create_tensor: loading tensor blk.15.attn_v.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.attn_q.bias
create_tensor: loading tensor blk.15.attn_k.bias
create_tensor: loading tensor blk.15.attn_v.bias
create_tensor: loading tensor blk.15.ffn_norm.weight
create_tensor: loading tensor blk.15.ffn_gate.weight
create_tensor: loading tensor blk.15.ffn_down.weight
create_tensor: loading tensor blk.15.ffn_up.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.attn_q.weight
create_tensor: loading tensor blk.16.attn_k.weight
create_tensor: loading tensor blk.16.attn_v.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.attn_q.bias
create_tensor: loading tensor blk.16.attn_k.bias
create_tensor: loading tensor blk.16.attn_v.bias
create_tensor: loading tensor blk.16.ffn_norm.weight
create_tensor: loading tensor blk.16.ffn_gate.weight
create_tensor: loading tensor blk.16.ffn_down.weight
create_tensor: loading tensor blk.16.ffn_up.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.attn_q.weight
create_tensor: loading tensor blk.17.attn_k.weight
create_tensor: loading tensor blk.17.attn_v.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.attn_q.bias
create_tensor: loading tensor blk.17.attn_k.bias
create_tensor: loading tensor blk.17.attn_v.bias
create_tensor: loading tensor blk.17.ffn_norm.weight
create_tensor: loading tensor blk.17.ffn_gate.weight
create_tensor: loading tensor blk.17.ffn_down.weight
create_tensor: loading tensor blk.17.ffn_up.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.attn_q.weight
create_tensor: loading tensor blk.18.attn_k.weight
create_tensor: loading tensor blk.18.attn_v.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.attn_q.bias
create_tensor: loading tensor blk.18.attn_k.bias
create_tensor: loading tensor blk.18.attn_v.bias
create_tensor: loading tensor blk.18.ffn_norm.weight
create_tensor: loading tensor blk.18.ffn_gate.weight
create_tensor: loading tensor blk.18.ffn_down.weight
create_tensor: loading tensor blk.18.ffn_up.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.attn_q.weight
create_tensor: loading tensor blk.19.attn_k.weight
create_tensor: loading tensor blk.19.attn_v.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.attn_q.bias
create_tensor: loading tensor blk.19.attn_k.bias
create_tensor: loading tensor blk.19.attn_v.bias
create_tensor: loading tensor blk.19.ffn_norm.weight
create_tensor: loading tensor blk.19.ffn_gate.weight
create_tensor: loading tensor blk.19.ffn_down.weight
create_tensor: loading tensor blk.19.ffn_up.weight
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.attn_q.weight
create_tensor: loading tensor blk.20.attn_k.weight
create_tensor: loading tensor blk.20.attn_v.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.attn_q.bias
create_tensor: loading tensor blk.20.attn_k.bias
create_tensor: loading tensor blk.20.attn_v.bias
create_tensor: loading tensor blk.20.ffn_norm.weight
create_tensor: loading tensor blk.20.ffn_gate.weight
create_tensor: loading tensor blk.20.ffn_down.weight
create_tensor: loading tensor blk.20.ffn_up.weight
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.attn_q.weight
create_tensor: loading tensor blk.21.attn_k.weight
create_tensor: loading tensor blk.21.attn_v.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.attn_q.bias
create_tensor: loading tensor blk.21.attn_k.bias
create_tensor: loading tensor blk.21.attn_v.bias
create_tensor: loading tensor blk.21.ffn_norm.weight
create_tensor: loading tensor blk.21.ffn_gate.weight
create_tensor: loading tensor blk.21.ffn_down.weight
create_tensor: loading tensor blk.21.ffn_up.weight
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.attn_q.weight
create_tensor: loading tensor blk.22.attn_k.weight
create_tensor: loading tensor blk.22.attn_v.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.attn_q.bias
create_tensor: loading tensor blk.22.attn_k.bias
create_tensor: loading tensor blk.22.attn_v.bias
create_tensor: loading tensor blk.22.ffn_norm.weight
create_tensor: loading tensor blk.22.ffn_gate.weight
create_tensor: loading tensor blk.22.ffn_down.weight
create_tensor: loading tensor blk.22.ffn_up.weight
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.attn_q.weight
create_tensor: loading tensor blk.23.attn_k.weight
create_tensor: loading tensor blk.23.attn_v.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.attn_q.bias
create_tensor: loading tensor blk.23.attn_k.bias
create_tensor: loading tensor blk.23.attn_v.bias
create_tensor: loading tensor blk.23.ffn_norm.weight
create_tensor: loading tensor blk.23.ffn_gate.weight
create_tensor: loading tensor blk.23.ffn_down.weight
create_tensor: loading tensor blk.23.ffn_up.weight
create_tensor: loading tensor blk.24.attn_norm.weight
create_tensor: loading tensor blk.24.attn_q.weight
create_tensor: loading tensor blk.24.attn_k.weight
create_tensor: loading tensor blk.24.attn_v.weight
create_tensor: loading tensor blk.24.attn_output.weight
create_tensor: loading tensor blk.24.attn_q.bias
create_tensor: loading tensor blk.24.attn_k.bias
create_tensor: loading tensor blk.24.attn_v.bias
create_tensor: loading tensor blk.24.ffn_norm.weight
create_tensor: loading tensor blk.24.ffn_gate.weight
create_tensor: loading tensor blk.24.ffn_down.weight
create_tensor: loading tensor blk.24.ffn_up.weight
create_tensor: loading tensor blk.25.attn_norm.weight
create_tensor: loading tensor blk.25.attn_q.weight
create_tensor: loading tensor blk.25.attn_k.weight
create_tensor: loading tensor blk.25.attn_v.weight
create_tensor: loading tensor blk.25.attn_output.weight
create_tensor: loading tensor blk.25.attn_q.bias
create_tensor: loading tensor blk.25.attn_k.bias
create_tensor: loading tensor blk.25.attn_v.bias
create_tensor: loading tensor blk.25.ffn_norm.weight
create_tensor: loading tensor blk.25.ffn_gate.weight
create_tensor: loading tensor blk.25.ffn_down.weight
create_tensor: loading tensor blk.25.ffn_up.weight
create_tensor: loading tensor blk.26.attn_norm.weight
create_tensor: loading tensor blk.26.attn_q.weight
create_tensor: loading tensor blk.26.attn_k.weight
create_tensor: loading tensor blk.26.attn_v.weight
create_tensor: loading tensor blk.26.attn_output.weight
create_tensor: loading tensor blk.26.attn_q.bias
create_tensor: loading tensor blk.26.attn_k.bias
create_tensor: loading tensor blk.26.attn_v.bias
create_tensor: loading tensor blk.26.ffn_norm.weight
create_tensor: loading tensor blk.26.ffn_gate.weight
create_tensor: loading tensor blk.26.ffn_down.weight
create_tensor: loading tensor blk.26.ffn_up.weight
create_tensor: loading tensor blk.27.attn_norm.weight
create_tensor: loading tensor blk.27.attn_q.weight
create_tensor: loading tensor blk.27.attn_k.weight
create_tensor: loading tensor blk.27.attn_v.weight
create_tensor: loading tensor blk.27.attn_output.weight
create_tensor: loading tensor blk.27.attn_q.bias
create_tensor: loading tensor blk.27.attn_k.bias
create_tensor: loading tensor blk.27.attn_v.bias
create_tensor: loading tensor blk.27.ffn_norm.weight
create_tensor: loading tensor blk.27.ffn_gate.weight
create_tensor: loading tensor blk.27.ffn_down.weight
create_tensor: loading tensor blk.27.ffn_up.weight
load_tensors:          CPU model buffer size =  4168.09 MiB
load_all_data: no device found for buffer type CPU for async uploads
time=2025-12-18T09:57:28.618Z level=DEBUG source=server.go:1345 msg="model load progress 0.13"
time=2025-12-18T09:57:28.869Z level=DEBUG source=server.go:1345 msg="model load progress 0.26"
time=2025-12-18T09:57:29.119Z level=DEBUG source=server.go:1345 msg="model load progress 0.37"
time=2025-12-18T09:57:29.370Z level=DEBUG source=server.go:1345 msg="model load progress 0.49"
time=2025-12-18T09:57:29.621Z level=DEBUG source=server.go:1345 msg="model load progress 0.60"
time=2025-12-18T09:57:29.872Z level=DEBUG source=server.go:1345 msg="model load progress 0.73"
time=2025-12-18T09:57:30.123Z level=DEBUG source=server.go:1345 msg="model load progress 0.85"
time=2025-12-18T09:57:30.373Z level=DEBUG source=server.go:1345 msg="model load progress 0.98"
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 4096
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.59 MiB
llama_kv_cache: layer   0: dev = CPU
llama_kv_cache: layer   1: dev = CPU
llama_kv_cache: layer   2: dev = CPU
llama_kv_cache: layer   3: dev = CPU
llama_kv_cache: layer   4: dev = CPU
llama_kv_cache: layer   5: dev = CPU
llama_kv_cache: layer   6: dev = CPU
llama_kv_cache: layer   7: dev = CPU
llama_kv_cache: layer   8: dev = CPU
llama_kv_cache: layer   9: dev = CPU
llama_kv_cache: layer  10: dev = CPU
llama_kv_cache: layer  11: dev = CPU
llama_kv_cache: layer  12: dev = CPU
llama_kv_cache: layer  13: dev = CPU
llama_kv_cache: layer  14: dev = CPU
llama_kv_cache: layer  15: dev = CPU
llama_kv_cache: layer  16: dev = CPU
llama_kv_cache: layer  17: dev = CPU
llama_kv_cache: layer  18: dev = CPU
llama_kv_cache: layer  19: dev = CPU
llama_kv_cache: layer  20: dev = CPU
llama_kv_cache: layer  21: dev = CPU
llama_kv_cache: layer  22: dev = CPU
llama_kv_cache: layer  23: dev = CPU
llama_kv_cache: layer  24: dev = CPU
llama_kv_cache: layer  25: dev = CPU
llama_kv_cache: layer  26: dev = CPU
llama_kv_cache: layer  27: dev = CPU
llama_kv_cache:        CPU KV buffer size =   224.00 MiB
llama_kv_cache: size =  224.00 MiB (  4096 cells,  28 layers,  1/1 seqs), K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 1
llama_context: max_nodes = 2704
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 4096, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens = 4096, n_seqs =  1, n_outputs = 4096
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens = 4096, n_seqs =  1, n_outputs = 4096
llama_context:        CPU compute buffer size =  2512.08 MiB
llama_context: graph nodes  = 1099
llama_context: graph splits = 1
time=2025-12-18T09:57:30.625Z level=INFO source=server.go:1339 msg="llama runner started in 2.53 seconds"
time=2025-12-18T09:57:30.625Z level=INFO source=sched.go:517 msg="loaded runners" count=1
time=2025-12-18T09:57:30.625Z level=INFO source=server.go:1301 msg="waiting for llama runner to start responding"
time=2025-12-18T09:57:30.625Z level=INFO source=server.go:1339 msg="llama runner started in 2.53 seconds"
time=2025-12-18T09:57:30.625Z level=DEBUG source=sched.go:529 msg="finished setting up" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=173 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096
time=2025-12-18T09:57:30.640Z level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32
time=2025-12-18T09:57:30.642Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=2 used=0 remaining=2
//ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:4748: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
/usr/lib/ollama/libggml-base.so.SOVERSION(+0x1a888)[0x7f97ec0b4888]
/usr/lib/ollama/libggml-base.so.SOVERSION(ggml_print_backtrace+0x1e6)[0x7f97ec0b4c56]
/usr/lib/ollama/libggml-base.so.SOVERSION(ggml_abort+0x11d)[0x7f97ec0b4ddd]
/usr/lib/ollama/libggml-cpu-skylakex.so(+0x7103e)[0x7f97e6ef403e]
/usr/lib/ollama/libggml-cpu-skylakex.so(+0x1484c)[0x7f97e6e9784c]
/usr/lib/ollama/libggml-cpu-skylakex.so(ggml_graph_compute+0xdc)[0x7f97e6e99d5c]
/usr/lib/ollama/libggml-cpu-skylakex.so(+0x171b3)[0x7f97e6e9a1b3]
/usr/bin/ollama(+0x11194a0)[0x55fee03344a0]
/usr/bin/ollama(+0x1196f99)[0x55fee03b1f99]
/usr/bin/ollama(+0x11972c2)[0x55fee03b22c2]
/usr/bin/ollama(+0x119d844)[0x55fee03b8844]
/usr/bin/ollama(+0x119e64c)[0x55fee03b964c]
/usr/bin/ollama(+0x10ae1c1)[0x55fee02c91c1]
/usr/bin/ollama(+0x37c1c1)[0x55fedf5971c1]
SIGABRT: abort
PC=0x7f9835fb4b2c m=0 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 5 gp=0xc000306a80 m=0 mp=0x55fee13d7020 [syscall]:
runtime.cgocall(0x55fee02c9180, 0xc000331b88)
        runtime/cgocall.go:167 +0x4b fp=0xc000331b60 sp=0xc000331b28 pc=0x55fedf58c22b
github.com/ollama/ollama/llama._Cfunc_llama_decode(0x7f97ce691500, {0x2, 0x7f97cc183d80, 0x0, 0x7f97cc1c4560, 0x7f97cc1c8570, 0x7f97cc1cc580, 0x7f97cc1db5f0})
        _cgo_gotypes.go:683 +0x4a fp=0xc000331b88 sp=0xc000331b60 pc=0x55fedf94516a
github.com/ollama/ollama/llama.(*Context).Decode.func1(...)
        github.com/ollama/ollama/llama/llama.go:169
github.com/ollama/ollama/llama.(*Context).Decode(0xc000238060?, 0x1?)
        github.com/ollama/ollama/llama/llama.go:169 +0xed fp=0xc000331c70 sp=0xc000331b88 pc=0x55fedf94826d
github.com/ollama/ollama/runner/llamarunner.(*Server).processBatch(0xc0000d1900, 0xc0000f4280, 0xc00030f728)
        github.com/ollama/ollama/runner/llamarunner/runner.go:493 +0x250 fp=0xc000331ee8 sp=0xc000331c70 pc=0x55fedf9ffb30
github.com/ollama/ollama/runner/llamarunner.(*Server).run(0xc0000d1900, {0x55fee0b08c60, 0xc000175e00})
        github.com/ollama/ollama/runner/llamarunner/runner.go:386 +0x1d5 fp=0xc000331fb8 sp=0xc000331ee8 pc=0x55fedf9ff775
github.com/ollama/ollama/runner/llamarunner.Execute.gowrap1()
        github.com/ollama/ollama/runner/llamarunner/runner.go:980 +0x28 fp=0xc000331fe0 sp=0xc000331fb8 pc=0x55fedfa04b48
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000331fe8 sp=0xc000331fe0 pc=0x55fedf597541
created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1
        github.com/ollama/ollama/runner/llamarunner/runner.go:980 +0x4c5

goroutine 1 gp=0xc000002380 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000523790 sp=0xc000523770 pc=0x55fedf58f6ae
runtime.netpollblock(0xc0005237e0?, 0xdf528de6?, 0xfe?)
        runtime/netpoll.go:575 +0xf7 fp=0xc0005237c8 sp=0xc000523790 pc=0x55fedf5549d7
internal/poll.runtime_pollWait(0x7f97ee406de0, 0x72)
        runtime/netpoll.go:351 +0x85 fp=0xc0005237e8 sp=0xc0005237c8 pc=0x55fedf58e8c5
internal/poll.(*pollDesc).wait(0xc0000b3c80?, 0x900000036?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000523810 sp=0xc0005237e8 pc=0x55fedf616a47
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc0000b3c80)
        internal/poll/fd_unix.go:620 +0x295 fp=0xc0005238b8 sp=0xc000523810 pc=0x55fedf61be15
net.(*netFD).accept(0xc0000b3c80)
        net/fd_unix.go:172 +0x29 fp=0xc000523970 sp=0xc0005238b8 pc=0x55fedf68ece9
net.(*TCPListener).accept(0xc00004d0c0)
        net/tcpsock_posix.go:159 +0x1b fp=0xc0005239c0 sp=0xc000523970 pc=0x55fedf6a469b
net.(*TCPListener).Accept(0xc00004d0c0)
        net/tcpsock.go:380 +0x30 fp=0xc0005239f0 sp=0xc0005239c0 pc=0x55fedf6a3550
net/http.(*onceCloseListener).Accept(0xc0000d83f0?)
        <autogenerated>:1 +0x24 fp=0xc000523a08 sp=0xc0005239f0 pc=0x55fedf8bad24
net/http.(*Server).Serve(0xc0001f7700, {0x55fee0b06640, 0xc00004d0c0})
        net/http/server.go:3424 +0x30c fp=0xc000523b38 sp=0xc000523a08 pc=0x55fedf8925ec
github.com/ollama/ollama/runner/llamarunner.Execute({0xc000116200, 0x4, 0x4})
        github.com/ollama/ollama/runner/llamarunner/runner.go:1001 +0x8f5 fp=0xc000523d08 sp=0xc000523b38 pc=0x55fedfa048d5
github.com/ollama/ollama/runner.Execute({0xc0001161f0?, 0x0?, 0x0?})
        github.com/ollama/ollama/runner/runner.go:22 +0xd4 fp=0xc000523d30 sp=0xc000523d08 pc=0x55fedfaad974
github.com/ollama/ollama/cmd.NewCLI.func2(0xc0001f7400?, {0x55fee05f70ad?, 0x4?, 0x55fee05f70b1?})
        github.com/ollama/ollama/cmd/cmd.go:1841 +0x45 fp=0xc000523d58 sp=0xc000523d30 pc=0x55fee0259505
github.com/spf13/cobra.(*Command).execute(0xc0000dd508, {0xc00004cf00, 0x4, 0x4})
        github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc000523e78 sp=0xc000523d58 pc=0x55fedf70833c
github.com/spf13/cobra.(*Command).ExecuteC(0xc0000a4908)
        github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc000523f30 sp=0xc000523e78 pc=0x55fedf708b85
github.com/spf13/cobra.(*Command).Execute(...)
        github.com/spf13/cobra@v1.7.0/command.go:992
github.com/spf13/cobra.(*Command).ExecuteContext(...)
        github.com/spf13/cobra@v1.7.0/command.go:985
main.main()
        github.com/ollama/ollama/main.go:12 +0x4d fp=0xc000523f50 sp=0xc000523f30 pc=0x55fee0259fed
runtime.main()
        runtime/proc.go:283 +0x29d fp=0xc000523fe0 sp=0xc000523f50 pc=0x55fedf55c05d
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000523fe8 sp=0xc000523fe0 pc=0x55fedf597541

goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00006efa8 sp=0xc00006ef88 pc=0x55fedf58f6ae
runtime.goparkunlock(...)
        runtime/proc.go:441
runtime.forcegchelper()
        runtime/proc.go:348 +0xb8 fp=0xc00006efe0 sp=0xc00006efa8 pc=0x55fedf55c398
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00006efe8 sp=0xc00006efe0 pc=0x55fedf597541
created by runtime.init.7 in goroutine 1
        runtime/proc.go:336 +0x1a

goroutine 18 gp=0xc000102380 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00006a780 sp=0xc00006a760 pc=0x55fedf58f6ae
runtime.goparkunlock(...)
        runtime/proc.go:441
runtime.bgsweep(0xc000110000)
        runtime/mgcsweep.go:316 +0xdf fp=0xc00006a7c8 sp=0xc00006a780 pc=0x55fedf546b3f
runtime.gcenable.gowrap1()
        runtime/mgc.go:204 +0x25 fp=0xc00006a7e0 sp=0xc00006a7c8 pc=0x55fedf53af25
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00006a7e8 sp=0xc00006a7e0 pc=0x55fedf597541
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:204 +0x66

goroutine 19 gp=0xc000102540 m=nil [GC scavenge wait]:
runtime.gopark(0x10000?, 0x55fee07c4cf8?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00006af78 sp=0xc00006af58 pc=0x55fedf58f6ae
runtime.goparkunlock(...)
        runtime/proc.go:441
runtime.(*scavengerState).park(0x55fee13d4200)
        runtime/mgcscavenge.go:425 +0x49 fp=0xc00006afa8 sp=0xc00006af78 pc=0x55fedf544589
runtime.bgscavenge(0xc000110000)
        runtime/mgcscavenge.go:658 +0x59 fp=0xc00006afc8 sp=0xc00006afa8 pc=0x55fedf544b19
runtime.gcenable.gowrap2()
        runtime/mgc.go:205 +0x25 fp=0xc00006afe0 sp=0xc00006afc8 pc=0x55fedf53aec5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00006afe8 sp=0xc00006afe0 pc=0x55fedf597541
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:205 +0xa5

goroutine 20 gp=0xc000102a80 m=nil [finalizer wait]:
runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc00006e688?)
        runtime/proc.go:435 +0xce fp=0xc00006e630 sp=0xc00006e610 pc=0x55fedf58f6ae
runtime.runfinq()
        runtime/mfinal.go:196 +0x107 fp=0xc00006e7e0 sp=0xc00006e630 pc=0x55fedf539ee7
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00006e7e8 sp=0xc00006e7e0 pc=0x55fedf597541
created by runtime.createfing in goroutine 1
        runtime/mfinal.go:166 +0x3d

goroutine 21 gp=0xc000103500 m=nil [chan receive]:
runtime.gopark(0xc000229900?, 0xc000590018?, 0x60?, 0xb7?, 0x55fedf675928?)
        runtime/proc.go:435 +0xce fp=0xc00006b718 sp=0xc00006b6f8 pc=0x55fedf58f6ae
runtime.chanrecv(0xc000118310, 0x0, 0x1)
        runtime/chan.go:664 +0x445 fp=0xc00006b790 sp=0xc00006b718 pc=0x55fedf52b9c5
runtime.chanrecv1(0x0?, 0x0?)
        runtime/chan.go:506 +0x12 fp=0xc00006b7b8 sp=0xc00006b790 pc=0x55fedf52b552
runtime.unique_runtime_registerUniqueMapCleanup.func2(...)
        runtime/mgc.go:1796
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
        runtime/mgc.go:1799 +0x2f fp=0xc00006b7e0 sp=0xc00006b7b8 pc=0x55fedf53e0cf
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00006b7e8 sp=0xc00006b7e0 pc=0x55fedf597541
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
        runtime/mgc.go:1794 +0x85

goroutine 22 gp=0xc000103a40 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00006bf38 sp=0xc00006bf18 pc=0x55fedf58f6ae
runtime.gcBgMarkWorker(0xc000119730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00006bfc8 sp=0xc00006bf38 pc=0x55fedf53d3e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00006bfe0 sp=0xc00006bfc8 pc=0x55fedf53d2c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00006bfe8 sp=0xc00006bfe0 pc=0x55fedf597541
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 23 gp=0xc000103c00 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00006c738 sp=0xc00006c718 pc=0x55fedf58f6ae
runtime.gcBgMarkWorker(0xc000119730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00006c7c8 sp=0xc00006c738 pc=0x55fedf53d3e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00006c7e0 sp=0xc00006c7c8 pc=0x55fedf53d2c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00006c7e8 sp=0xc00006c7e0 pc=0x55fedf597541
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 34 gp=0xc000306000 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00030c738 sp=0xc00030c718 pc=0x55fedf58f6ae
runtime.gcBgMarkWorker(0xc000119730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00030c7c8 sp=0xc00030c738 pc=0x55fedf53d3e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00030c7e0 sp=0xc00030c7c8 pc=0x55fedf53d2c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00030c7e8 sp=0xc00030c7e0 pc=0x55fedf597541
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 35 gp=0xc0003061c0 m=nil [GC worker (idle)]:
runtime.gopark(0xd24871050930d?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00030cf38 sp=0xc00030cf18 pc=0x55fedf58f6ae
runtime.gcBgMarkWorker(0xc000119730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00030cfc8 sp=0xc00030cf38 pc=0x55fedf53d3e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00030cfe0 sp=0xc00030cfc8 pc=0x55fedf53d2c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00030cfe8 sp=0xc00030cfe0 pc=0x55fedf597541
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 3 gp=0xc0000036c0 m=nil [GC worker (idle)]:
runtime.gopark(0xd248710515e7f?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00006f738 sp=0xc00006f718 pc=0x55fedf58f6ae
runtime.gcBgMarkWorker(0xc000119730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00006f7c8 sp=0xc00006f738 pc=0x55fedf53d3e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00006f7e0 sp=0xc00006f7c8 pc=0x55fedf53d2c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00006f7e8 sp=0xc00006f7e0 pc=0x55fedf597541
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 24 gp=0xc000103dc0 m=nil [GC worker (idle)]:
runtime.gopark(0xd2487104e9bfa?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00006cf38 sp=0xc00006cf18 pc=0x55fedf58f6ae
runtime.gcBgMarkWorker(0xc000119730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00006cfc8 sp=0xc00006cf38 pc=0x55fedf53d3e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00006cfe0 sp=0xc00006cfc8 pc=0x55fedf53d2c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00006cfe8 sp=0xc00006cfe0 pc=0x55fedf597541
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 36 gp=0xc000306380 m=nil [GC worker (idle)]:
runtime.gopark(0xd24871051b2ed?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00030d738 sp=0xc00030d718 pc=0x55fedf58f6ae
runtime.gcBgMarkWorker(0xc000119730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00030d7c8 sp=0xc00030d738 pc=0x55fedf53d3e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00030d7e0 sp=0xc00030d7c8 pc=0x55fedf53d2c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00030d7e8 sp=0xc00030d7e0 pc=0x55fedf597541
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 4 gp=0xc000003880 m=nil [GC worker (idle)]:
runtime.gopark(0xd248710518979?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00006ff38 sp=0xc00006ff18 pc=0x55fedf58f6ae
runtime.gcBgMarkWorker(0xc000119730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00006ffc8 sp=0xc00006ff38 pc=0x55fedf53d3e9
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00006ffe0 sp=0xc00006ffc8 pc=0x55fedf53d2c5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00006ffe8 sp=0xc00006ffe0 pc=0x55fedf597541
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 6 gp=0xc000306c40 m=nil [chan receive]:
runtime.gopark(0x55fedf595554?, 0xc0000458d0?, 0x70?, 0xf3?, 0xc0000458b8?)
        runtime/proc.go:435 +0xce fp=0xc000045898 sp=0xc000045878 pc=0x55fedf58f6ae
runtime.chanrecv(0xc000426070, 0xc000045a70, 0x1)
        runtime/chan.go:664 +0x445 fp=0xc000045910 sp=0xc000045898 pc=0x55fedf52b9c5
runtime.chanrecv1(0xc0000d6f60?, 0xc000115c00?)
        runtime/chan.go:506 +0x12 fp=0xc000045938 sp=0xc000045910 pc=0x55fedf52b552
github.com/ollama/ollama/runner/llamarunner.(*Server).embeddings(0xc0000d1900, {0x55fee0b06820, 0xc00016c7e0}, 0xc00001d2c0)
        github.com/ollama/ollama/runner/llamarunner/runner.go:806 +0x72d fp=0xc000045ac0 sp=0xc000045938 pc=0x55fedfa0268d
github.com/ollama/ollama/runner/llamarunner.(*Server).embeddings-fm({0x55fee0b06820?, 0xc00016c7e0?}, 0xc000045b40?)
        <autogenerated>:1 +0x36 fp=0xc000045af0 sp=0xc000045ac0 pc=0x55fedfa04ed6
net/http.HandlerFunc.ServeHTTP(0xc00052d5c0?, {0x55fee0b06820?, 0xc00016c7e0?}, 0xc000045b60?)
        net/http/server.go:2294 +0x29 fp=0xc000045b18 sp=0xc000045af0 pc=0x55fedf88ec29
net/http.(*ServeMux).ServeHTTP(0x55fedf534405?, {0x55fee0b06820, 0xc00016c7e0}, 0xc00001d2c0)
        net/http/server.go:2822 +0x1c4 fp=0xc000045b68 sp=0xc000045b18 pc=0x55fedf890b24
net/http.serverHandler.ServeHTTP({0x55fee0b02e10?}, {0x55fee0b06820?, 0xc00016c7e0?}, 0x1?)
        net/http/server.go:3301 +0x8e fp=0xc000045b98 sp=0xc000045b68 pc=0x55fedf8ae5ae
net/http.(*conn).serve(0xc0000d83f0, {0x55fee0b08c28, 0xc0000d6960})
        net/http/server.go:2102 +0x625 fp=0xc000045fb8 sp=0xc000045b98 pc=0x55fedf88d125
net/http.(*Server).Serve.gowrap3()
        net/http/server.go:3454 +0x28 fp=0xc000045fe0 sp=0xc000045fb8 pc=0x55fedf8929e8
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000045fe8 sp=0xc000045fe0 pc=0x55fedf597541
created by net/http.(*Server).Serve in goroutine 1
        net/http/server.go:3454 +0x485

goroutine 47 gp=0xc000306e00 m=nil [IO wait]:
runtime.gopark(0x55fee1368940?, 0xc00030fe38?, 0x38?, 0xfe?, 0xb?)
        runtime/proc.go:435 +0xce fp=0xc00030fdd8 sp=0xc00030fdb8 pc=0x55fedf58f6ae
runtime.netpollblock(0x55fedf5b2e78?, 0xdf528de6?, 0xfe?)
        runtime/netpoll.go:575 +0xf7 fp=0xc00030fe10 sp=0xc00030fdd8 pc=0x55fedf5549d7
internal/poll.runtime_pollWait(0x7f97ee406cc8, 0x72)
        runtime/netpoll.go:351 +0x85 fp=0xc00030fe30 sp=0xc00030fe10 pc=0x55fedf58e8c5
internal/poll.(*pollDesc).wait(0xc0000b3d00?, 0xc0000d6a61?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00030fe58 sp=0xc00030fe30 pc=0x55fedf616a47
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0000b3d00, {0xc0000d6a61, 0x1, 0x1})
        internal/poll/fd_unix.go:165 +0x27a fp=0xc00030fef0 sp=0xc00030fe58 pc=0x55fedf617d3a
net.(*netFD).Read(0xc0000b3d00, {0xc0000d6a61?, 0xc00004d198?, 0xc00030ff70?})
        net/fd_posix.go:55 +0x25 fp=0xc00030ff38 sp=0xc00030fef0 pc=0x55fedf68cd45
net.(*conn).Read(0xc000124968, {0xc0000d6a61?, 0x0?, 0x0?})
        net/net.go:194 +0x45 fp=0xc00030ff80 sp=0xc00030ff38 pc=0x55fedf69b105
net/http.(*connReader).backgroundRead(0xc0000d6a50)
        net/http/server.go:690 +0x37 fp=0xc00030ffc8 sp=0xc00030ff80 pc=0x55fedf886ff7
net/http.(*connReader).startBackgroundRead.gowrap2()
        net/http/server.go:686 +0x25 fp=0xc00030ffe0 sp=0xc00030ffc8 pc=0x55fedf886f25
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00030ffe8 sp=0xc00030ffe0 pc=0x55fedf597541
created by net/http.(*connReader).startBackgroundRead in goroutine 6
        net/http/server.go:686 +0xb6

rax    0x0
rbx    0xad
rcx    0x7f9835fb4b2c
rdx    0x6
rdi    0xad
rsi    0xad
rbp    0x7ffe92e34c90
rsp    0x7ffe92e34c50
r8     0x0
r9     0x7
r10    0x8
r11    0x246
r12    0x6
r13    0x7f97e6f63260
r14    0x16
r15    0x7f961a3c6040
rip    0x7f9835fb4b2c
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
[GIN] 2025/12/18 - 09:57:31 | 400 |  3.428637064s |       127.0.0.1 | POST     "/api/embed"
time=2025-12-18T09:57:31.081Z level=ERROR source=server.go:265 msg="llama runner terminated" error="exit status 2"
time=2025-12-18T09:57:31.081Z level=DEBUG source=sched.go:537 msg="context for request finished"
time=2025-12-18T09:57:31.081Z level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=173 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096 duration=5m0s
time=2025-12-18T09:57:31.081Z level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=173 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096 refCount=0

Same if reduced num_ctx

curl http://localhost:11434/api/embed -d '{
∙   "model": "manutic/nomic-embed-code:7b-Q4_K_M",
∙   "input": ["Represent this query for searching relevant code: def factorial(n): return 1 if n == 0 else n * factorial(n-1)"],
∙   "options": {"num_ctx": 256}
∙ }'
{"error":"do embedding request: Post \"http://127.0.0.1:41215/embedding\": EOF"}
<!-- gh-comment-id:3669465633 --> @noobie-bob commented on GitHub (Dec 18, 2025): Got similar error with nomic-embed code, has any used nomic-embed-code successfully with ollama ? ```bash ollama run manutic/nomic-embed-code:7b-Q4_K_M "def factorial" Error: do embedding request: Post "http://127.0.0.1:41255/embedding": EOF ``` Machine linux, docker container, RHEL 8 , 32GB RAM, 8 CORE CPU, latest ollama 0.13.3, ``` ollama -v ollama version is 0.13.3 ``` similar to https://github.com/ollama/ollama/issues/8140, https://github.com/ollama/ollama/issues/12585 from traceback Backtrace log ```log [GIN] 2025/12/18 - 09:57:27 | 200 | 28.651µs | 127.0.0.1 | HEAD "/" time=2025-12-18T09:57:27.651Z level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32 [GIN] 2025/12/18 - 09:57:27 | 200 | 55.429525ms | 127.0.0.1 | POST "/api/show" time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 time=2025-12-18T09:57:27.730Z level=DEBUG source=server.go:1291 msg="server unhealthy" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 time=2025-12-18T09:57:27.730Z level=DEBUG source=server.go:1291 msg="server unhealthy" error="llama runner process no longer running: 2 GGML_ASSERT(i01 >= 0 && i01 < ne01) failed" time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:161 msg=reloading runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096 time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:236 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096 refCount=0 time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:247 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096 time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:311 msg="runner expired event received" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096 time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:326 msg="got lock to unload expired event" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096 time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:349 msg="starting background wait for VRAM recovery" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096 time=2025-12-18T09:57:27.730Z level=DEBUG source=sched.go:674 msg="no need to wait for VRAM recovery" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096 time=2025-12-18T09:57:27.770Z level=DEBUG source=server.go:1766 msg="stopping llama server" pid=147 time=2025-12-18T09:57:27.770Z level=DEBUG source=sched.go:358 msg="runner terminated and removed from list, blocking for VRAM recovery" runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 time=2025-12-18T09:57:27.770Z level=DEBUG source=sched.go:361 msg="sending an unloaded event" runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 time=2025-12-18T09:57:27.770Z level=DEBUG source=sched.go:253 msg="unload completed" runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=147 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 time=2025-12-18T09:57:27.770Z level=DEBUG source=runner.go:264 msg="refreshing free memory" time=2025-12-18T09:57:27.770Z level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=8.479µs time=2025-12-18T09:57:27.771Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax" time=2025-12-18T09:57:27.787Z level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32 time=2025-12-18T09:57:27.787Z level=DEBUG source=sched.go:211 msg="loading first model" model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 llama_model_loader: loaded meta data with 56 key-value pairs and 338 tensors from /root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = nomic-embed-code-f16-gguf llama_model_loader: - kv 3: general.size_label str = 7.1B llama_model_loader: - kv 4: general.license str = apache-2.0 llama_model_loader: - kv 5: general.base_model.count u32 = 1 llama_model_loader: - kv 6: general.base_model.0.name str = Qwen2.5 Coder 7B Instruct llama_model_loader: - kv 7: general.base_model.0.organization str = Qwen llama_model_loader: - kv 8: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-C... llama_model_loader: - kv 9: general.dataset.count u32 = 6 llama_model_loader: - kv 10: general.dataset.0.name str = Cornstack Python v1 llama_model_loader: - kv 11: general.dataset.0.version str = v1 llama_model_loader: - kv 12: general.dataset.0.organization str = Nomic Ai llama_model_loader: - kv 13: general.dataset.0.repo_url str = https://huggingface.co/nomic-ai/corns... llama_model_loader: - kv 14: general.dataset.1.name str = Cornstack Javascript v1 llama_model_loader: - kv 15: general.dataset.1.version str = v1 llama_model_loader: - kv 16: general.dataset.1.organization str = Nomic Ai llama_model_loader: - kv 17: general.dataset.1.repo_url str = https://huggingface.co/nomic-ai/corns... llama_model_loader: - kv 18: general.dataset.2.name str = Cornstack Java v1 llama_model_loader: - kv 19: general.dataset.2.version str = v1 llama_model_loader: - kv 20: general.dataset.2.organization str = Nomic Ai llama_model_loader: - kv 21: general.dataset.2.repo_url str = https://huggingface.co/nomic-ai/corns... llama_model_loader: - kv 22: general.dataset.3.name str = Cornstack Go v1 llama_model_loader: - kv 23: general.dataset.3.version str = v1 llama_model_loader: - kv 24: general.dataset.3.organization str = Nomic Ai llama_model_loader: - kv 25: general.dataset.3.repo_url str = https://huggingface.co/nomic-ai/corns... llama_model_loader: - kv 26: general.dataset.4.name str = Cornstack Php v1 llama_model_loader: - kv 27: general.dataset.4.version str = v1 llama_model_loader: - kv 28: general.dataset.4.organization str = Nomic Ai llama_model_loader: - kv 29: general.dataset.4.repo_url str = https://huggingface.co/nomic-ai/corns... llama_model_loader: - kv 30: general.dataset.5.name str = Cornstack Ruby v1 llama_model_loader: - kv 31: general.dataset.5.version str = v1 llama_model_loader: - kv 32: general.dataset.5.organization str = Nomic Ai llama_model_loader: - kv 33: general.dataset.5.repo_url str = https://huggingface.co/nomic-ai/corns... llama_model_loader: - kv 34: general.tags arr[str,4] = ["sentence-transformers", "sentence-s... llama_model_loader: - kv 35: qwen2.block_count u32 = 28 llama_model_loader: - kv 36: qwen2.context_length u32 = 32768 llama_model_loader: - kv 37: qwen2.embedding_length u32 = 3584 llama_model_loader: - kv 38: qwen2.feed_forward_length u32 = 18944 llama_model_loader: - kv 39: qwen2.attention.head_count u32 = 28 llama_model_loader: - kv 40: qwen2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 41: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 42: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 43: qwen2.pooling_type u32 = 3 llama_model_loader: - kv 44: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 45: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 46: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 47: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 48: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 49: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 50: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 51: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 52: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 53: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 54: general.quantization_version u32 = 2 llama_model_loader: - kv 55: general.file_type u32 = 15 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 4.07 GiB (4.95 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: printing all EOG tokens: load: - 151643 ('<|endoftext|>') load: - 151645 ('<|im_end|>') load: - 151662 ('<|fim_pad|>') load: - 151663 ('<|repo_name|>') load: - 151664 ('<|file_sep|>') load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 7.07 B print_info: general.name = nomic-embed-code-f16-gguf print_info: vocab type = BPE print_info: n_vocab = 152064 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-12-18T09:57:28.093Z level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 --port 36061" time=2025-12-18T09:57:28.093Z level=DEBUG source=server.go:393 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_DEBUG=1 LD_LIBRARY_PATH=/usr/lib/ollama:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_LIBRARY_PATH=/usr/lib/ollama time=2025-12-18T09:57:28.093Z level=INFO source=sched.go:443 msg="system memory" total="31.1 GiB" free="30.9 GiB" free_swap="0 B" time=2025-12-18T09:57:28.093Z level=INFO source=server.go:459 msg="loading model" "model layers"=29 requested=-1 time=2025-12-18T09:57:28.093Z level=INFO source=server.go:481 msg="embedding model detected, setting batch size to context length" batch_size=4096 time=2025-12-18T09:57:28.093Z level=DEBUG source=ggml.go:279 msg="key with type not found" key=qwen2.attention.key_length default=128 time=2025-12-18T09:57:28.093Z level=DEBUG source=ggml.go:279 msg="key with type not found" key=qwen2.attention.value_length default=128 time=2025-12-18T09:57:28.093Z level=DEBUG source=ggml.go:614 msg="default cache size estimate" "attention MiB"=224 "attention bytes"=234881024 "recurrent MiB"=0 "recurrent bytes"=0 time=2025-12-18T09:57:28.093Z level=DEBUG source=server.go:621 msg=memory estimate.CPU.Weights="[149112832 149112832 149112832 131135488 131135488 149112832 131135488 131135488 149112832 131135488 131135488 149112832 131135488 131135488 149112832 131135488 131135488 149112832 131135488 131135488 149112832 131135488 131135488 149112832 149112832 149112832 149112832 149112832 447082496]" estimate.CPU.Cache="[8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 0]" time=2025-12-18T09:57:28.093Z level=INFO source=device.go:245 msg="model weights" device=CPU size="4.1 GiB" time=2025-12-18T09:57:28.093Z level=INFO source=device.go:256 msg="kv cache" device=CPU size="224.0 MiB" time=2025-12-18T09:57:28.094Z level=INFO source=device.go:272 msg="total memory" size="4.3 GiB" time=2025-12-18T09:57:28.105Z level=INFO source=runner.go:964 msg="starting go runner" time=2025-12-18T09:57:28.105Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-skylakex.so time=2025-12-18T09:57:28.111Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-12-18T09:57:28.111Z level=INFO source=runner.go:1000 msg="Server listening on 127.0.0.1:36061" time=2025-12-18T09:57:28.115Z level=INFO source=runner.go:894 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:4096 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-12-18T09:57:28.115Z level=INFO source=server.go:1301 msg="waiting for llama runner to start responding" time=2025-12-18T09:57:28.116Z level=INFO source=server.go:1335 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 56 key-value pairs and 338 tensors from /root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = nomic-embed-code-f16-gguf llama_model_loader: - kv 3: general.size_label str = 7.1B llama_model_loader: - kv 4: general.license str = apache-2.0 llama_model_loader: - kv 5: general.base_model.count u32 = 1 llama_model_loader: - kv 6: general.base_model.0.name str = Qwen2.5 Coder 7B Instruct llama_model_loader: - kv 7: general.base_model.0.organization str = Qwen llama_model_loader: - kv 8: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-C... llama_model_loader: - kv 9: general.dataset.count u32 = 6 llama_model_loader: - kv 10: general.dataset.0.name str = Cornstack Python v1 llama_model_loader: - kv 11: general.dataset.0.version str = v1 llama_model_loader: - kv 12: general.dataset.0.organization str = Nomic Ai llama_model_loader: - kv 13: general.dataset.0.repo_url str = https://huggingface.co/nomic-ai/corns... llama_model_loader: - kv 14: general.dataset.1.name str = Cornstack Javascript v1 llama_model_loader: - kv 15: general.dataset.1.version str = v1 llama_model_loader: - kv 16: general.dataset.1.organization str = Nomic Ai llama_model_loader: - kv 17: general.dataset.1.repo_url str = https://huggingface.co/nomic-ai/corns... llama_model_loader: - kv 18: general.dataset.2.name str = Cornstack Java v1 llama_model_loader: - kv 19: general.dataset.2.version str = v1 llama_model_loader: - kv 20: general.dataset.2.organization str = Nomic Ai llama_model_loader: - kv 21: general.dataset.2.repo_url str = https://huggingface.co/nomic-ai/corns... llama_model_loader: - kv 22: general.dataset.3.name str = Cornstack Go v1 llama_model_loader: - kv 23: general.dataset.3.version str = v1 llama_model_loader: - kv 24: general.dataset.3.organization str = Nomic Ai llama_model_loader: - kv 25: general.dataset.3.repo_url str = https://huggingface.co/nomic-ai/corns... llama_model_loader: - kv 26: general.dataset.4.name str = Cornstack Php v1 llama_model_loader: - kv 27: general.dataset.4.version str = v1 llama_model_loader: - kv 28: general.dataset.4.organization str = Nomic Ai llama_model_loader: - kv 29: general.dataset.4.repo_url str = https://huggingface.co/nomic-ai/corns... llama_model_loader: - kv 30: general.dataset.5.name str = Cornstack Ruby v1 llama_model_loader: - kv 31: general.dataset.5.version str = v1 llama_model_loader: - kv 32: general.dataset.5.organization str = Nomic Ai llama_model_loader: - kv 33: general.dataset.5.repo_url str = https://huggingface.co/nomic-ai/corns... llama_model_loader: - kv 34: general.tags arr[str,4] = ["sentence-transformers", "sentence-s... llama_model_loader: - kv 35: qwen2.block_count u32 = 28 llama_model_loader: - kv 36: qwen2.context_length u32 = 32768 llama_model_loader: - kv 37: qwen2.embedding_length u32 = 3584 llama_model_loader: - kv 38: qwen2.feed_forward_length u32 = 18944 llama_model_loader: - kv 39: qwen2.attention.head_count u32 = 28 llama_model_loader: - kv 40: qwen2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 41: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 42: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 43: qwen2.pooling_type u32 = 3 llama_model_loader: - kv 44: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 45: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 46: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 47: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 48: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 49: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 50: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 51: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 52: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 53: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 54: general.quantization_version u32 = 2 llama_model_loader: - kv 55: general.file_type u32 = 15 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 4.07 GiB (4.95 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: printing all EOG tokens: load: - 151643 ('<|endoftext|>') load: - 151645 ('<|im_end|>') load: - 151662 ('<|fim_pad|>') load: - 151663 ('<|repo_name|>') load: - 151664 ('<|file_sep|>') load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 0 print_info: n_ctx_train = 32768 print_info: n_embd = 3584 print_info: n_embd_inp = 3584 print_info: n_layer = 28 print_info: n_head = 28 print_info: n_head_kv = 4 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 7 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 18944 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = 3 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 32768 print_info: rope_finetuned = unknown print_info: model type = 7B print_info: model params = 7.07 B print_info: general.name = nomic-embed-code-f16-gguf print_info: vocab type = BPE print_info: n_vocab = 152064 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: layer 0 assigned to device CPU, is_swa = 0 load_tensors: layer 1 assigned to device CPU, is_swa = 0 load_tensors: layer 2 assigned to device CPU, is_swa = 0 load_tensors: layer 3 assigned to device CPU, is_swa = 0 load_tensors: layer 4 assigned to device CPU, is_swa = 0 load_tensors: layer 5 assigned to device CPU, is_swa = 0 load_tensors: layer 6 assigned to device CPU, is_swa = 0 load_tensors: layer 7 assigned to device CPU, is_swa = 0 load_tensors: layer 8 assigned to device CPU, is_swa = 0 load_tensors: layer 9 assigned to device CPU, is_swa = 0 load_tensors: layer 10 assigned to device CPU, is_swa = 0 load_tensors: layer 11 assigned to device CPU, is_swa = 0 load_tensors: layer 12 assigned to device CPU, is_swa = 0 load_tensors: layer 13 assigned to device CPU, is_swa = 0 load_tensors: layer 14 assigned to device CPU, is_swa = 0 load_tensors: layer 15 assigned to device CPU, is_swa = 0 load_tensors: layer 16 assigned to device CPU, is_swa = 0 load_tensors: layer 17 assigned to device CPU, is_swa = 0 load_tensors: layer 18 assigned to device CPU, is_swa = 0 load_tensors: layer 19 assigned to device CPU, is_swa = 0 load_tensors: layer 20 assigned to device CPU, is_swa = 0 load_tensors: layer 21 assigned to device CPU, is_swa = 0 load_tensors: layer 22 assigned to device CPU, is_swa = 0 load_tensors: layer 23 assigned to device CPU, is_swa = 0 load_tensors: layer 24 assigned to device CPU, is_swa = 0 load_tensors: layer 25 assigned to device CPU, is_swa = 0 load_tensors: layer 26 assigned to device CPU, is_swa = 0 load_tensors: layer 27 assigned to device CPU, is_swa = 0 load_tensors: layer 28 assigned to device CPU, is_swa = 0 create_tensor: loading tensor token_embd.weight create_tensor: loading tensor output_norm.weight create_tensor: loading tensor blk.0.attn_norm.weight create_tensor: loading tensor blk.0.attn_q.weight create_tensor: loading tensor blk.0.attn_k.weight create_tensor: loading tensor blk.0.attn_v.weight create_tensor: loading tensor blk.0.attn_output.weight create_tensor: loading tensor blk.0.attn_q.bias create_tensor: loading tensor blk.0.attn_k.bias create_tensor: loading tensor blk.0.attn_v.bias create_tensor: loading tensor blk.0.ffn_norm.weight create_tensor: loading tensor blk.0.ffn_gate.weight create_tensor: loading tensor blk.0.ffn_down.weight create_tensor: loading tensor blk.0.ffn_up.weight create_tensor: loading tensor blk.1.attn_norm.weight create_tensor: loading tensor blk.1.attn_q.weight create_tensor: loading tensor blk.1.attn_k.weight create_tensor: loading tensor blk.1.attn_v.weight create_tensor: loading tensor blk.1.attn_output.weight create_tensor: loading tensor blk.1.attn_q.bias create_tensor: loading tensor blk.1.attn_k.bias create_tensor: loading tensor blk.1.attn_v.bias create_tensor: loading tensor blk.1.ffn_norm.weight create_tensor: loading tensor blk.1.ffn_gate.weight create_tensor: loading tensor blk.1.ffn_down.weight create_tensor: loading tensor blk.1.ffn_up.weight create_tensor: loading tensor blk.2.attn_norm.weight create_tensor: loading tensor blk.2.attn_q.weight create_tensor: loading tensor blk.2.attn_k.weight create_tensor: loading tensor blk.2.attn_v.weight create_tensor: loading tensor blk.2.attn_output.weight create_tensor: loading tensor blk.2.attn_q.bias create_tensor: loading tensor blk.2.attn_k.bias create_tensor: loading tensor blk.2.attn_v.bias create_tensor: loading tensor blk.2.ffn_norm.weight create_tensor: loading tensor blk.2.ffn_gate.weight create_tensor: loading tensor blk.2.ffn_down.weight create_tensor: loading tensor blk.2.ffn_up.weight create_tensor: loading tensor blk.3.attn_norm.weight create_tensor: loading tensor blk.3.attn_q.weight create_tensor: loading tensor blk.3.attn_k.weight create_tensor: loading tensor blk.3.attn_v.weight create_tensor: loading tensor blk.3.attn_output.weight create_tensor: loading tensor blk.3.attn_q.bias create_tensor: loading tensor blk.3.attn_k.bias create_tensor: loading tensor blk.3.attn_v.bias create_tensor: loading tensor blk.3.ffn_norm.weight create_tensor: loading tensor blk.3.ffn_gate.weight create_tensor: loading tensor blk.3.ffn_down.weight create_tensor: loading tensor blk.3.ffn_up.weight create_tensor: loading tensor blk.4.attn_norm.weight create_tensor: loading tensor blk.4.attn_q.weight create_tensor: loading tensor blk.4.attn_k.weight create_tensor: loading tensor blk.4.attn_v.weight create_tensor: loading tensor blk.4.attn_output.weight create_tensor: loading tensor blk.4.attn_q.bias create_tensor: loading tensor blk.4.attn_k.bias create_tensor: loading tensor blk.4.attn_v.bias create_tensor: loading tensor blk.4.ffn_norm.weight create_tensor: loading tensor blk.4.ffn_gate.weight create_tensor: loading tensor blk.4.ffn_down.weight create_tensor: loading tensor blk.4.ffn_up.weight create_tensor: loading tensor blk.5.attn_norm.weight create_tensor: loading tensor blk.5.attn_q.weight create_tensor: loading tensor blk.5.attn_k.weight create_tensor: loading tensor blk.5.attn_v.weight create_tensor: loading tensor blk.5.attn_output.weight create_tensor: loading tensor blk.5.attn_q.bias create_tensor: loading tensor blk.5.attn_k.bias create_tensor: loading tensor blk.5.attn_v.bias create_tensor: loading tensor blk.5.ffn_norm.weight create_tensor: loading tensor blk.5.ffn_gate.weight create_tensor: loading tensor blk.5.ffn_down.weight create_tensor: loading tensor blk.5.ffn_up.weight create_tensor: loading tensor blk.6.attn_norm.weight create_tensor: loading tensor blk.6.attn_q.weight create_tensor: loading tensor blk.6.attn_k.weight create_tensor: loading tensor blk.6.attn_v.weight create_tensor: loading tensor blk.6.attn_output.weight create_tensor: loading tensor blk.6.attn_q.bias create_tensor: loading tensor blk.6.attn_k.bias create_tensor: loading tensor blk.6.attn_v.bias create_tensor: loading tensor blk.6.ffn_norm.weight create_tensor: loading tensor blk.6.ffn_gate.weight create_tensor: loading tensor blk.6.ffn_down.weight create_tensor: loading tensor blk.6.ffn_up.weight create_tensor: loading tensor blk.7.attn_norm.weight create_tensor: loading tensor blk.7.attn_q.weight create_tensor: loading tensor blk.7.attn_k.weight create_tensor: loading tensor blk.7.attn_v.weight create_tensor: loading tensor blk.7.attn_output.weight create_tensor: loading tensor blk.7.attn_q.bias create_tensor: loading tensor blk.7.attn_k.bias create_tensor: loading tensor blk.7.attn_v.bias create_tensor: loading tensor blk.7.ffn_norm.weight create_tensor: loading tensor blk.7.ffn_gate.weight create_tensor: loading tensor blk.7.ffn_down.weight create_tensor: loading tensor blk.7.ffn_up.weight create_tensor: loading tensor blk.8.attn_norm.weight create_tensor: loading tensor blk.8.attn_q.weight create_tensor: loading tensor blk.8.attn_k.weight create_tensor: loading tensor blk.8.attn_v.weight create_tensor: loading tensor blk.8.attn_output.weight create_tensor: loading tensor blk.8.attn_q.bias create_tensor: loading tensor blk.8.attn_k.bias create_tensor: loading tensor blk.8.attn_v.bias create_tensor: loading tensor blk.8.ffn_norm.weight create_tensor: loading tensor blk.8.ffn_gate.weight create_tensor: loading tensor blk.8.ffn_down.weight create_tensor: loading tensor blk.8.ffn_up.weight create_tensor: loading tensor blk.9.attn_norm.weight create_tensor: loading tensor blk.9.attn_q.weight create_tensor: loading tensor blk.9.attn_k.weight create_tensor: loading tensor blk.9.attn_v.weight create_tensor: loading tensor blk.9.attn_output.weight create_tensor: loading tensor blk.9.attn_q.bias create_tensor: loading tensor blk.9.attn_k.bias create_tensor: loading tensor blk.9.attn_v.bias create_tensor: loading tensor blk.9.ffn_norm.weight create_tensor: loading tensor blk.9.ffn_gate.weight create_tensor: loading tensor blk.9.ffn_down.weight create_tensor: loading tensor blk.9.ffn_up.weight create_tensor: loading tensor blk.10.attn_norm.weight create_tensor: loading tensor blk.10.attn_q.weight create_tensor: loading tensor blk.10.attn_k.weight create_tensor: loading tensor blk.10.attn_v.weight create_tensor: loading tensor blk.10.attn_output.weight create_tensor: loading tensor blk.10.attn_q.bias create_tensor: loading tensor blk.10.attn_k.bias create_tensor: loading tensor blk.10.attn_v.bias create_tensor: loading tensor blk.10.ffn_norm.weight create_tensor: loading tensor blk.10.ffn_gate.weight create_tensor: loading tensor blk.10.ffn_down.weight create_tensor: loading tensor blk.10.ffn_up.weight create_tensor: loading tensor blk.11.attn_norm.weight create_tensor: loading tensor blk.11.attn_q.weight create_tensor: loading tensor blk.11.attn_k.weight create_tensor: loading tensor blk.11.attn_v.weight create_tensor: loading tensor blk.11.attn_output.weight create_tensor: loading tensor blk.11.attn_q.bias create_tensor: loading tensor blk.11.attn_k.bias create_tensor: loading tensor blk.11.attn_v.bias create_tensor: loading tensor blk.11.ffn_norm.weight create_tensor: loading tensor blk.11.ffn_gate.weight create_tensor: loading tensor blk.11.ffn_down.weight create_tensor: loading tensor blk.11.ffn_up.weight create_tensor: loading tensor blk.12.attn_norm.weight create_tensor: loading tensor blk.12.attn_q.weight create_tensor: loading tensor blk.12.attn_k.weight create_tensor: loading tensor blk.12.attn_v.weight create_tensor: loading tensor blk.12.attn_output.weight create_tensor: loading tensor blk.12.attn_q.bias create_tensor: loading tensor blk.12.attn_k.bias create_tensor: loading tensor blk.12.attn_v.bias create_tensor: loading tensor blk.12.ffn_norm.weight create_tensor: loading tensor blk.12.ffn_gate.weight create_tensor: loading tensor blk.12.ffn_down.weight create_tensor: loading tensor blk.12.ffn_up.weight create_tensor: loading tensor blk.13.attn_norm.weight create_tensor: loading tensor blk.13.attn_q.weight create_tensor: loading tensor blk.13.attn_k.weight create_tensor: loading tensor blk.13.attn_v.weight create_tensor: loading tensor blk.13.attn_output.weight create_tensor: loading tensor blk.13.attn_q.bias create_tensor: loading tensor blk.13.attn_k.bias create_tensor: loading tensor blk.13.attn_v.bias create_tensor: loading tensor blk.13.ffn_norm.weight create_tensor: loading tensor blk.13.ffn_gate.weight create_tensor: loading tensor blk.13.ffn_down.weight create_tensor: loading tensor blk.13.ffn_up.weight create_tensor: loading tensor blk.14.attn_norm.weight create_tensor: loading tensor blk.14.attn_q.weight create_tensor: loading tensor blk.14.attn_k.weight create_tensor: loading tensor blk.14.attn_v.weight create_tensor: loading tensor blk.14.attn_output.weight create_tensor: loading tensor blk.14.attn_q.bias create_tensor: loading tensor blk.14.attn_k.bias create_tensor: loading tensor blk.14.attn_v.bias create_tensor: loading tensor blk.14.ffn_norm.weight create_tensor: loading tensor blk.14.ffn_gate.weight create_tensor: loading tensor blk.14.ffn_down.weight create_tensor: loading tensor blk.14.ffn_up.weight create_tensor: loading tensor blk.15.attn_norm.weight create_tensor: loading tensor blk.15.attn_q.weight create_tensor: loading tensor blk.15.attn_k.weight create_tensor: loading tensor blk.15.attn_v.weight create_tensor: loading tensor blk.15.attn_output.weight create_tensor: loading tensor blk.15.attn_q.bias create_tensor: loading tensor blk.15.attn_k.bias create_tensor: loading tensor blk.15.attn_v.bias create_tensor: loading tensor blk.15.ffn_norm.weight create_tensor: loading tensor blk.15.ffn_gate.weight create_tensor: loading tensor blk.15.ffn_down.weight create_tensor: loading tensor blk.15.ffn_up.weight create_tensor: loading tensor blk.16.attn_norm.weight create_tensor: loading tensor blk.16.attn_q.weight create_tensor: loading tensor blk.16.attn_k.weight create_tensor: loading tensor blk.16.attn_v.weight create_tensor: loading tensor blk.16.attn_output.weight create_tensor: loading tensor blk.16.attn_q.bias create_tensor: loading tensor blk.16.attn_k.bias create_tensor: loading tensor blk.16.attn_v.bias create_tensor: loading tensor blk.16.ffn_norm.weight create_tensor: loading tensor blk.16.ffn_gate.weight create_tensor: loading tensor blk.16.ffn_down.weight create_tensor: loading tensor blk.16.ffn_up.weight create_tensor: loading tensor blk.17.attn_norm.weight create_tensor: loading tensor blk.17.attn_q.weight create_tensor: loading tensor blk.17.attn_k.weight create_tensor: loading tensor blk.17.attn_v.weight create_tensor: loading tensor blk.17.attn_output.weight create_tensor: loading tensor blk.17.attn_q.bias create_tensor: loading tensor blk.17.attn_k.bias create_tensor: loading tensor blk.17.attn_v.bias create_tensor: loading tensor blk.17.ffn_norm.weight create_tensor: loading tensor blk.17.ffn_gate.weight create_tensor: loading tensor blk.17.ffn_down.weight create_tensor: loading tensor blk.17.ffn_up.weight create_tensor: loading tensor blk.18.attn_norm.weight create_tensor: loading tensor blk.18.attn_q.weight create_tensor: loading tensor blk.18.attn_k.weight create_tensor: loading tensor blk.18.attn_v.weight create_tensor: loading tensor blk.18.attn_output.weight create_tensor: loading tensor blk.18.attn_q.bias create_tensor: loading tensor blk.18.attn_k.bias create_tensor: loading tensor blk.18.attn_v.bias create_tensor: loading tensor blk.18.ffn_norm.weight create_tensor: loading tensor blk.18.ffn_gate.weight create_tensor: loading tensor blk.18.ffn_down.weight create_tensor: loading tensor blk.18.ffn_up.weight create_tensor: loading tensor blk.19.attn_norm.weight create_tensor: loading tensor blk.19.attn_q.weight create_tensor: loading tensor blk.19.attn_k.weight create_tensor: loading tensor blk.19.attn_v.weight create_tensor: loading tensor blk.19.attn_output.weight create_tensor: loading tensor blk.19.attn_q.bias create_tensor: loading tensor blk.19.attn_k.bias create_tensor: loading tensor blk.19.attn_v.bias create_tensor: loading tensor blk.19.ffn_norm.weight create_tensor: loading tensor blk.19.ffn_gate.weight create_tensor: loading tensor blk.19.ffn_down.weight create_tensor: loading tensor blk.19.ffn_up.weight create_tensor: loading tensor blk.20.attn_norm.weight create_tensor: loading tensor blk.20.attn_q.weight create_tensor: loading tensor blk.20.attn_k.weight create_tensor: loading tensor blk.20.attn_v.weight create_tensor: loading tensor blk.20.attn_output.weight create_tensor: loading tensor blk.20.attn_q.bias create_tensor: loading tensor blk.20.attn_k.bias create_tensor: loading tensor blk.20.attn_v.bias create_tensor: loading tensor blk.20.ffn_norm.weight create_tensor: loading tensor blk.20.ffn_gate.weight create_tensor: loading tensor blk.20.ffn_down.weight create_tensor: loading tensor blk.20.ffn_up.weight create_tensor: loading tensor blk.21.attn_norm.weight create_tensor: loading tensor blk.21.attn_q.weight create_tensor: loading tensor blk.21.attn_k.weight create_tensor: loading tensor blk.21.attn_v.weight create_tensor: loading tensor blk.21.attn_output.weight create_tensor: loading tensor blk.21.attn_q.bias create_tensor: loading tensor blk.21.attn_k.bias create_tensor: loading tensor blk.21.attn_v.bias create_tensor: loading tensor blk.21.ffn_norm.weight create_tensor: loading tensor blk.21.ffn_gate.weight create_tensor: loading tensor blk.21.ffn_down.weight create_tensor: loading tensor blk.21.ffn_up.weight create_tensor: loading tensor blk.22.attn_norm.weight create_tensor: loading tensor blk.22.attn_q.weight create_tensor: loading tensor blk.22.attn_k.weight create_tensor: loading tensor blk.22.attn_v.weight create_tensor: loading tensor blk.22.attn_output.weight create_tensor: loading tensor blk.22.attn_q.bias create_tensor: loading tensor blk.22.attn_k.bias create_tensor: loading tensor blk.22.attn_v.bias create_tensor: loading tensor blk.22.ffn_norm.weight create_tensor: loading tensor blk.22.ffn_gate.weight create_tensor: loading tensor blk.22.ffn_down.weight create_tensor: loading tensor blk.22.ffn_up.weight create_tensor: loading tensor blk.23.attn_norm.weight create_tensor: loading tensor blk.23.attn_q.weight create_tensor: loading tensor blk.23.attn_k.weight create_tensor: loading tensor blk.23.attn_v.weight create_tensor: loading tensor blk.23.attn_output.weight create_tensor: loading tensor blk.23.attn_q.bias create_tensor: loading tensor blk.23.attn_k.bias create_tensor: loading tensor blk.23.attn_v.bias create_tensor: loading tensor blk.23.ffn_norm.weight create_tensor: loading tensor blk.23.ffn_gate.weight create_tensor: loading tensor blk.23.ffn_down.weight create_tensor: loading tensor blk.23.ffn_up.weight create_tensor: loading tensor blk.24.attn_norm.weight create_tensor: loading tensor blk.24.attn_q.weight create_tensor: loading tensor blk.24.attn_k.weight create_tensor: loading tensor blk.24.attn_v.weight create_tensor: loading tensor blk.24.attn_output.weight create_tensor: loading tensor blk.24.attn_q.bias create_tensor: loading tensor blk.24.attn_k.bias create_tensor: loading tensor blk.24.attn_v.bias create_tensor: loading tensor blk.24.ffn_norm.weight create_tensor: loading tensor blk.24.ffn_gate.weight create_tensor: loading tensor blk.24.ffn_down.weight create_tensor: loading tensor blk.24.ffn_up.weight create_tensor: loading tensor blk.25.attn_norm.weight create_tensor: loading tensor blk.25.attn_q.weight create_tensor: loading tensor blk.25.attn_k.weight create_tensor: loading tensor blk.25.attn_v.weight create_tensor: loading tensor blk.25.attn_output.weight create_tensor: loading tensor blk.25.attn_q.bias create_tensor: loading tensor blk.25.attn_k.bias create_tensor: loading tensor blk.25.attn_v.bias create_tensor: loading tensor blk.25.ffn_norm.weight create_tensor: loading tensor blk.25.ffn_gate.weight create_tensor: loading tensor blk.25.ffn_down.weight create_tensor: loading tensor blk.25.ffn_up.weight create_tensor: loading tensor blk.26.attn_norm.weight create_tensor: loading tensor blk.26.attn_q.weight create_tensor: loading tensor blk.26.attn_k.weight create_tensor: loading tensor blk.26.attn_v.weight create_tensor: loading tensor blk.26.attn_output.weight create_tensor: loading tensor blk.26.attn_q.bias create_tensor: loading tensor blk.26.attn_k.bias create_tensor: loading tensor blk.26.attn_v.bias create_tensor: loading tensor blk.26.ffn_norm.weight create_tensor: loading tensor blk.26.ffn_gate.weight create_tensor: loading tensor blk.26.ffn_down.weight create_tensor: loading tensor blk.26.ffn_up.weight create_tensor: loading tensor blk.27.attn_norm.weight create_tensor: loading tensor blk.27.attn_q.weight create_tensor: loading tensor blk.27.attn_k.weight create_tensor: loading tensor blk.27.attn_v.weight create_tensor: loading tensor blk.27.attn_output.weight create_tensor: loading tensor blk.27.attn_q.bias create_tensor: loading tensor blk.27.attn_k.bias create_tensor: loading tensor blk.27.attn_v.bias create_tensor: loading tensor blk.27.ffn_norm.weight create_tensor: loading tensor blk.27.ffn_gate.weight create_tensor: loading tensor blk.27.ffn_down.weight create_tensor: loading tensor blk.27.ffn_up.weight load_tensors: CPU model buffer size = 4168.09 MiB load_all_data: no device found for buffer type CPU for async uploads time=2025-12-18T09:57:28.618Z level=DEBUG source=server.go:1345 msg="model load progress 0.13" time=2025-12-18T09:57:28.869Z level=DEBUG source=server.go:1345 msg="model load progress 0.26" time=2025-12-18T09:57:29.119Z level=DEBUG source=server.go:1345 msg="model load progress 0.37" time=2025-12-18T09:57:29.370Z level=DEBUG source=server.go:1345 msg="model load progress 0.49" time=2025-12-18T09:57:29.621Z level=DEBUG source=server.go:1345 msg="model load progress 0.60" time=2025-12-18T09:57:29.872Z level=DEBUG source=server.go:1345 msg="model load progress 0.73" time=2025-12-18T09:57:30.123Z level=DEBUG source=server.go:1345 msg="model load progress 0.85" time=2025-12-18T09:57:30.373Z level=DEBUG source=server.go:1345 msg="model load progress 0.98" llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_seq = 4096 llama_context: n_batch = 4096 llama_context: n_ubatch = 4096 llama_context: causal_attn = 1 llama_context: flash_attn = disabled llama_context: kv_unified = false llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized set_abort_callback: call llama_context: CPU output buffer size = 0.59 MiB llama_kv_cache: layer 0: dev = CPU llama_kv_cache: layer 1: dev = CPU llama_kv_cache: layer 2: dev = CPU llama_kv_cache: layer 3: dev = CPU llama_kv_cache: layer 4: dev = CPU llama_kv_cache: layer 5: dev = CPU llama_kv_cache: layer 6: dev = CPU llama_kv_cache: layer 7: dev = CPU llama_kv_cache: layer 8: dev = CPU llama_kv_cache: layer 9: dev = CPU llama_kv_cache: layer 10: dev = CPU llama_kv_cache: layer 11: dev = CPU llama_kv_cache: layer 12: dev = CPU llama_kv_cache: layer 13: dev = CPU llama_kv_cache: layer 14: dev = CPU llama_kv_cache: layer 15: dev = CPU llama_kv_cache: layer 16: dev = CPU llama_kv_cache: layer 17: dev = CPU llama_kv_cache: layer 18: dev = CPU llama_kv_cache: layer 19: dev = CPU llama_kv_cache: layer 20: dev = CPU llama_kv_cache: layer 21: dev = CPU llama_kv_cache: layer 22: dev = CPU llama_kv_cache: layer 23: dev = CPU llama_kv_cache: layer 24: dev = CPU llama_kv_cache: layer 25: dev = CPU llama_kv_cache: layer 26: dev = CPU llama_kv_cache: layer 27: dev = CPU llama_kv_cache: CPU KV buffer size = 224.00 MiB llama_kv_cache: size = 224.00 MiB ( 4096 cells, 28 layers, 1/1 seqs), K (f16): 112.00 MiB, V (f16): 112.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 1 llama_context: max_nodes = 2704 llama_context: reserving full memory module llama_context: worst-case: n_tokens = 4096, n_seqs = 1, n_outputs = 1 graph_reserve: reserving a graph for ubatch with n_tokens = 4096, n_seqs = 1, n_outputs = 4096 graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1 graph_reserve: reserving a graph for ubatch with n_tokens = 4096, n_seqs = 1, n_outputs = 4096 llama_context: CPU compute buffer size = 2512.08 MiB llama_context: graph nodes = 1099 llama_context: graph splits = 1 time=2025-12-18T09:57:30.625Z level=INFO source=server.go:1339 msg="llama runner started in 2.53 seconds" time=2025-12-18T09:57:30.625Z level=INFO source=sched.go:517 msg="loaded runners" count=1 time=2025-12-18T09:57:30.625Z level=INFO source=server.go:1301 msg="waiting for llama runner to start responding" time=2025-12-18T09:57:30.625Z level=INFO source=server.go:1339 msg="llama runner started in 2.53 seconds" time=2025-12-18T09:57:30.625Z level=DEBUG source=sched.go:529 msg="finished setting up" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=173 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096 time=2025-12-18T09:57:30.640Z level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32 time=2025-12-18T09:57:30.642Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=2 used=0 remaining=2 //ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp:4748: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed /usr/lib/ollama/libggml-base.so.SOVERSION(+0x1a888)[0x7f97ec0b4888] /usr/lib/ollama/libggml-base.so.SOVERSION(ggml_print_backtrace+0x1e6)[0x7f97ec0b4c56] /usr/lib/ollama/libggml-base.so.SOVERSION(ggml_abort+0x11d)[0x7f97ec0b4ddd] /usr/lib/ollama/libggml-cpu-skylakex.so(+0x7103e)[0x7f97e6ef403e] /usr/lib/ollama/libggml-cpu-skylakex.so(+0x1484c)[0x7f97e6e9784c] /usr/lib/ollama/libggml-cpu-skylakex.so(ggml_graph_compute+0xdc)[0x7f97e6e99d5c] /usr/lib/ollama/libggml-cpu-skylakex.so(+0x171b3)[0x7f97e6e9a1b3] /usr/bin/ollama(+0x11194a0)[0x55fee03344a0] /usr/bin/ollama(+0x1196f99)[0x55fee03b1f99] /usr/bin/ollama(+0x11972c2)[0x55fee03b22c2] /usr/bin/ollama(+0x119d844)[0x55fee03b8844] /usr/bin/ollama(+0x119e64c)[0x55fee03b964c] /usr/bin/ollama(+0x10ae1c1)[0x55fee02c91c1] /usr/bin/ollama(+0x37c1c1)[0x55fedf5971c1] SIGABRT: abort PC=0x7f9835fb4b2c m=0 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 5 gp=0xc000306a80 m=0 mp=0x55fee13d7020 [syscall]: runtime.cgocall(0x55fee02c9180, 0xc000331b88) runtime/cgocall.go:167 +0x4b fp=0xc000331b60 sp=0xc000331b28 pc=0x55fedf58c22b github.com/ollama/ollama/llama._Cfunc_llama_decode(0x7f97ce691500, {0x2, 0x7f97cc183d80, 0x0, 0x7f97cc1c4560, 0x7f97cc1c8570, 0x7f97cc1cc580, 0x7f97cc1db5f0}) _cgo_gotypes.go:683 +0x4a fp=0xc000331b88 sp=0xc000331b60 pc=0x55fedf94516a github.com/ollama/ollama/llama.(*Context).Decode.func1(...) github.com/ollama/ollama/llama/llama.go:169 github.com/ollama/ollama/llama.(*Context).Decode(0xc000238060?, 0x1?) github.com/ollama/ollama/llama/llama.go:169 +0xed fp=0xc000331c70 sp=0xc000331b88 pc=0x55fedf94826d github.com/ollama/ollama/runner/llamarunner.(*Server).processBatch(0xc0000d1900, 0xc0000f4280, 0xc00030f728) github.com/ollama/ollama/runner/llamarunner/runner.go:493 +0x250 fp=0xc000331ee8 sp=0xc000331c70 pc=0x55fedf9ffb30 github.com/ollama/ollama/runner/llamarunner.(*Server).run(0xc0000d1900, {0x55fee0b08c60, 0xc000175e00}) github.com/ollama/ollama/runner/llamarunner/runner.go:386 +0x1d5 fp=0xc000331fb8 sp=0xc000331ee8 pc=0x55fedf9ff775 github.com/ollama/ollama/runner/llamarunner.Execute.gowrap1() github.com/ollama/ollama/runner/llamarunner/runner.go:980 +0x28 fp=0xc000331fe0 sp=0xc000331fb8 pc=0x55fedfa04b48 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000331fe8 sp=0xc000331fe0 pc=0x55fedf597541 created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/llamarunner/runner.go:980 +0x4c5 goroutine 1 gp=0xc000002380 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000523790 sp=0xc000523770 pc=0x55fedf58f6ae runtime.netpollblock(0xc0005237e0?, 0xdf528de6?, 0xfe?) runtime/netpoll.go:575 +0xf7 fp=0xc0005237c8 sp=0xc000523790 pc=0x55fedf5549d7 internal/poll.runtime_pollWait(0x7f97ee406de0, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc0005237e8 sp=0xc0005237c8 pc=0x55fedf58e8c5 internal/poll.(*pollDesc).wait(0xc0000b3c80?, 0x900000036?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000523810 sp=0xc0005237e8 pc=0x55fedf616a47 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc0000b3c80) internal/poll/fd_unix.go:620 +0x295 fp=0xc0005238b8 sp=0xc000523810 pc=0x55fedf61be15 net.(*netFD).accept(0xc0000b3c80) net/fd_unix.go:172 +0x29 fp=0xc000523970 sp=0xc0005238b8 pc=0x55fedf68ece9 net.(*TCPListener).accept(0xc00004d0c0) net/tcpsock_posix.go:159 +0x1b fp=0xc0005239c0 sp=0xc000523970 pc=0x55fedf6a469b net.(*TCPListener).Accept(0xc00004d0c0) net/tcpsock.go:380 +0x30 fp=0xc0005239f0 sp=0xc0005239c0 pc=0x55fedf6a3550 net/http.(*onceCloseListener).Accept(0xc0000d83f0?) <autogenerated>:1 +0x24 fp=0xc000523a08 sp=0xc0005239f0 pc=0x55fedf8bad24 net/http.(*Server).Serve(0xc0001f7700, {0x55fee0b06640, 0xc00004d0c0}) net/http/server.go:3424 +0x30c fp=0xc000523b38 sp=0xc000523a08 pc=0x55fedf8925ec github.com/ollama/ollama/runner/llamarunner.Execute({0xc000116200, 0x4, 0x4}) github.com/ollama/ollama/runner/llamarunner/runner.go:1001 +0x8f5 fp=0xc000523d08 sp=0xc000523b38 pc=0x55fedfa048d5 github.com/ollama/ollama/runner.Execute({0xc0001161f0?, 0x0?, 0x0?}) github.com/ollama/ollama/runner/runner.go:22 +0xd4 fp=0xc000523d30 sp=0xc000523d08 pc=0x55fedfaad974 github.com/ollama/ollama/cmd.NewCLI.func2(0xc0001f7400?, {0x55fee05f70ad?, 0x4?, 0x55fee05f70b1?}) github.com/ollama/ollama/cmd/cmd.go:1841 +0x45 fp=0xc000523d58 sp=0xc000523d30 pc=0x55fee0259505 github.com/spf13/cobra.(*Command).execute(0xc0000dd508, {0xc00004cf00, 0x4, 0x4}) github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc000523e78 sp=0xc000523d58 pc=0x55fedf70833c github.com/spf13/cobra.(*Command).ExecuteC(0xc0000a4908) github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc000523f30 sp=0xc000523e78 pc=0x55fedf708b85 github.com/spf13/cobra.(*Command).Execute(...) github.com/spf13/cobra@v1.7.0/command.go:992 github.com/spf13/cobra.(*Command).ExecuteContext(...) github.com/spf13/cobra@v1.7.0/command.go:985 main.main() github.com/ollama/ollama/main.go:12 +0x4d fp=0xc000523f50 sp=0xc000523f30 pc=0x55fee0259fed runtime.main() runtime/proc.go:283 +0x29d fp=0xc000523fe0 sp=0xc000523f50 pc=0x55fedf55c05d runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000523fe8 sp=0xc000523fe0 pc=0x55fedf597541 goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006efa8 sp=0xc00006ef88 pc=0x55fedf58f6ae runtime.goparkunlock(...) runtime/proc.go:441 runtime.forcegchelper() runtime/proc.go:348 +0xb8 fp=0xc00006efe0 sp=0xc00006efa8 pc=0x55fedf55c398 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006efe8 sp=0xc00006efe0 pc=0x55fedf597541 created by runtime.init.7 in goroutine 1 runtime/proc.go:336 +0x1a goroutine 18 gp=0xc000102380 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006a780 sp=0xc00006a760 pc=0x55fedf58f6ae runtime.goparkunlock(...) runtime/proc.go:441 runtime.bgsweep(0xc000110000) runtime/mgcsweep.go:316 +0xdf fp=0xc00006a7c8 sp=0xc00006a780 pc=0x55fedf546b3f runtime.gcenable.gowrap1() runtime/mgc.go:204 +0x25 fp=0xc00006a7e0 sp=0xc00006a7c8 pc=0x55fedf53af25 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006a7e8 sp=0xc00006a7e0 pc=0x55fedf597541 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0x66 goroutine 19 gp=0xc000102540 m=nil [GC scavenge wait]: runtime.gopark(0x10000?, 0x55fee07c4cf8?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006af78 sp=0xc00006af58 pc=0x55fedf58f6ae runtime.goparkunlock(...) runtime/proc.go:441 runtime.(*scavengerState).park(0x55fee13d4200) runtime/mgcscavenge.go:425 +0x49 fp=0xc00006afa8 sp=0xc00006af78 pc=0x55fedf544589 runtime.bgscavenge(0xc000110000) runtime/mgcscavenge.go:658 +0x59 fp=0xc00006afc8 sp=0xc00006afa8 pc=0x55fedf544b19 runtime.gcenable.gowrap2() runtime/mgc.go:205 +0x25 fp=0xc00006afe0 sp=0xc00006afc8 pc=0x55fedf53aec5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006afe8 sp=0xc00006afe0 pc=0x55fedf597541 created by runtime.gcenable in goroutine 1 runtime/mgc.go:205 +0xa5 goroutine 20 gp=0xc000102a80 m=nil [finalizer wait]: runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc00006e688?) runtime/proc.go:435 +0xce fp=0xc00006e630 sp=0xc00006e610 pc=0x55fedf58f6ae runtime.runfinq() runtime/mfinal.go:196 +0x107 fp=0xc00006e7e0 sp=0xc00006e630 pc=0x55fedf539ee7 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006e7e8 sp=0xc00006e7e0 pc=0x55fedf597541 created by runtime.createfing in goroutine 1 runtime/mfinal.go:166 +0x3d goroutine 21 gp=0xc000103500 m=nil [chan receive]: runtime.gopark(0xc000229900?, 0xc000590018?, 0x60?, 0xb7?, 0x55fedf675928?) runtime/proc.go:435 +0xce fp=0xc00006b718 sp=0xc00006b6f8 pc=0x55fedf58f6ae runtime.chanrecv(0xc000118310, 0x0, 0x1) runtime/chan.go:664 +0x445 fp=0xc00006b790 sp=0xc00006b718 pc=0x55fedf52b9c5 runtime.chanrecv1(0x0?, 0x0?) runtime/chan.go:506 +0x12 fp=0xc00006b7b8 sp=0xc00006b790 pc=0x55fedf52b552 runtime.unique_runtime_registerUniqueMapCleanup.func2(...) runtime/mgc.go:1796 runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() runtime/mgc.go:1799 +0x2f fp=0xc00006b7e0 sp=0xc00006b7b8 pc=0x55fedf53e0cf runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006b7e8 sp=0xc00006b7e0 pc=0x55fedf597541 created by unique.runtime_registerUniqueMapCleanup in goroutine 1 runtime/mgc.go:1794 +0x85 goroutine 22 gp=0xc000103a40 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006bf38 sp=0xc00006bf18 pc=0x55fedf58f6ae runtime.gcBgMarkWorker(0xc000119730) runtime/mgc.go:1423 +0xe9 fp=0xc00006bfc8 sp=0xc00006bf38 pc=0x55fedf53d3e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00006bfe0 sp=0xc00006bfc8 pc=0x55fedf53d2c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006bfe8 sp=0xc00006bfe0 pc=0x55fedf597541 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 23 gp=0xc000103c00 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006c738 sp=0xc00006c718 pc=0x55fedf58f6ae runtime.gcBgMarkWorker(0xc000119730) runtime/mgc.go:1423 +0xe9 fp=0xc00006c7c8 sp=0xc00006c738 pc=0x55fedf53d3e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00006c7e0 sp=0xc00006c7c8 pc=0x55fedf53d2c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006c7e8 sp=0xc00006c7e0 pc=0x55fedf597541 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 34 gp=0xc000306000 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00030c738 sp=0xc00030c718 pc=0x55fedf58f6ae runtime.gcBgMarkWorker(0xc000119730) runtime/mgc.go:1423 +0xe9 fp=0xc00030c7c8 sp=0xc00030c738 pc=0x55fedf53d3e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00030c7e0 sp=0xc00030c7c8 pc=0x55fedf53d2c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00030c7e8 sp=0xc00030c7e0 pc=0x55fedf597541 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 35 gp=0xc0003061c0 m=nil [GC worker (idle)]: runtime.gopark(0xd24871050930d?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00030cf38 sp=0xc00030cf18 pc=0x55fedf58f6ae runtime.gcBgMarkWorker(0xc000119730) runtime/mgc.go:1423 +0xe9 fp=0xc00030cfc8 sp=0xc00030cf38 pc=0x55fedf53d3e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00030cfe0 sp=0xc00030cfc8 pc=0x55fedf53d2c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00030cfe8 sp=0xc00030cfe0 pc=0x55fedf597541 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 3 gp=0xc0000036c0 m=nil [GC worker (idle)]: runtime.gopark(0xd248710515e7f?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006f738 sp=0xc00006f718 pc=0x55fedf58f6ae runtime.gcBgMarkWorker(0xc000119730) runtime/mgc.go:1423 +0xe9 fp=0xc00006f7c8 sp=0xc00006f738 pc=0x55fedf53d3e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00006f7e0 sp=0xc00006f7c8 pc=0x55fedf53d2c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006f7e8 sp=0xc00006f7e0 pc=0x55fedf597541 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 24 gp=0xc000103dc0 m=nil [GC worker (idle)]: runtime.gopark(0xd2487104e9bfa?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006cf38 sp=0xc00006cf18 pc=0x55fedf58f6ae runtime.gcBgMarkWorker(0xc000119730) runtime/mgc.go:1423 +0xe9 fp=0xc00006cfc8 sp=0xc00006cf38 pc=0x55fedf53d3e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00006cfe0 sp=0xc00006cfc8 pc=0x55fedf53d2c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006cfe8 sp=0xc00006cfe0 pc=0x55fedf597541 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 36 gp=0xc000306380 m=nil [GC worker (idle)]: runtime.gopark(0xd24871051b2ed?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00030d738 sp=0xc00030d718 pc=0x55fedf58f6ae runtime.gcBgMarkWorker(0xc000119730) runtime/mgc.go:1423 +0xe9 fp=0xc00030d7c8 sp=0xc00030d738 pc=0x55fedf53d3e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00030d7e0 sp=0xc00030d7c8 pc=0x55fedf53d2c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00030d7e8 sp=0xc00030d7e0 pc=0x55fedf597541 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 4 gp=0xc000003880 m=nil [GC worker (idle)]: runtime.gopark(0xd248710518979?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006ff38 sp=0xc00006ff18 pc=0x55fedf58f6ae runtime.gcBgMarkWorker(0xc000119730) runtime/mgc.go:1423 +0xe9 fp=0xc00006ffc8 sp=0xc00006ff38 pc=0x55fedf53d3e9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00006ffe0 sp=0xc00006ffc8 pc=0x55fedf53d2c5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006ffe8 sp=0xc00006ffe0 pc=0x55fedf597541 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 6 gp=0xc000306c40 m=nil [chan receive]: runtime.gopark(0x55fedf595554?, 0xc0000458d0?, 0x70?, 0xf3?, 0xc0000458b8?) runtime/proc.go:435 +0xce fp=0xc000045898 sp=0xc000045878 pc=0x55fedf58f6ae runtime.chanrecv(0xc000426070, 0xc000045a70, 0x1) runtime/chan.go:664 +0x445 fp=0xc000045910 sp=0xc000045898 pc=0x55fedf52b9c5 runtime.chanrecv1(0xc0000d6f60?, 0xc000115c00?) runtime/chan.go:506 +0x12 fp=0xc000045938 sp=0xc000045910 pc=0x55fedf52b552 github.com/ollama/ollama/runner/llamarunner.(*Server).embeddings(0xc0000d1900, {0x55fee0b06820, 0xc00016c7e0}, 0xc00001d2c0) github.com/ollama/ollama/runner/llamarunner/runner.go:806 +0x72d fp=0xc000045ac0 sp=0xc000045938 pc=0x55fedfa0268d github.com/ollama/ollama/runner/llamarunner.(*Server).embeddings-fm({0x55fee0b06820?, 0xc00016c7e0?}, 0xc000045b40?) <autogenerated>:1 +0x36 fp=0xc000045af0 sp=0xc000045ac0 pc=0x55fedfa04ed6 net/http.HandlerFunc.ServeHTTP(0xc00052d5c0?, {0x55fee0b06820?, 0xc00016c7e0?}, 0xc000045b60?) net/http/server.go:2294 +0x29 fp=0xc000045b18 sp=0xc000045af0 pc=0x55fedf88ec29 net/http.(*ServeMux).ServeHTTP(0x55fedf534405?, {0x55fee0b06820, 0xc00016c7e0}, 0xc00001d2c0) net/http/server.go:2822 +0x1c4 fp=0xc000045b68 sp=0xc000045b18 pc=0x55fedf890b24 net/http.serverHandler.ServeHTTP({0x55fee0b02e10?}, {0x55fee0b06820?, 0xc00016c7e0?}, 0x1?) net/http/server.go:3301 +0x8e fp=0xc000045b98 sp=0xc000045b68 pc=0x55fedf8ae5ae net/http.(*conn).serve(0xc0000d83f0, {0x55fee0b08c28, 0xc0000d6960}) net/http/server.go:2102 +0x625 fp=0xc000045fb8 sp=0xc000045b98 pc=0x55fedf88d125 net/http.(*Server).Serve.gowrap3() net/http/server.go:3454 +0x28 fp=0xc000045fe0 sp=0xc000045fb8 pc=0x55fedf8929e8 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000045fe8 sp=0xc000045fe0 pc=0x55fedf597541 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3454 +0x485 goroutine 47 gp=0xc000306e00 m=nil [IO wait]: runtime.gopark(0x55fee1368940?, 0xc00030fe38?, 0x38?, 0xfe?, 0xb?) runtime/proc.go:435 +0xce fp=0xc00030fdd8 sp=0xc00030fdb8 pc=0x55fedf58f6ae runtime.netpollblock(0x55fedf5b2e78?, 0xdf528de6?, 0xfe?) runtime/netpoll.go:575 +0xf7 fp=0xc00030fe10 sp=0xc00030fdd8 pc=0x55fedf5549d7 internal/poll.runtime_pollWait(0x7f97ee406cc8, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc00030fe30 sp=0xc00030fe10 pc=0x55fedf58e8c5 internal/poll.(*pollDesc).wait(0xc0000b3d00?, 0xc0000d6a61?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00030fe58 sp=0xc00030fe30 pc=0x55fedf616a47 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc0000b3d00, {0xc0000d6a61, 0x1, 0x1}) internal/poll/fd_unix.go:165 +0x27a fp=0xc00030fef0 sp=0xc00030fe58 pc=0x55fedf617d3a net.(*netFD).Read(0xc0000b3d00, {0xc0000d6a61?, 0xc00004d198?, 0xc00030ff70?}) net/fd_posix.go:55 +0x25 fp=0xc00030ff38 sp=0xc00030fef0 pc=0x55fedf68cd45 net.(*conn).Read(0xc000124968, {0xc0000d6a61?, 0x0?, 0x0?}) net/net.go:194 +0x45 fp=0xc00030ff80 sp=0xc00030ff38 pc=0x55fedf69b105 net/http.(*connReader).backgroundRead(0xc0000d6a50) net/http/server.go:690 +0x37 fp=0xc00030ffc8 sp=0xc00030ff80 pc=0x55fedf886ff7 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:686 +0x25 fp=0xc00030ffe0 sp=0xc00030ffc8 pc=0x55fedf886f25 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00030ffe8 sp=0xc00030ffe0 pc=0x55fedf597541 created by net/http.(*connReader).startBackgroundRead in goroutine 6 net/http/server.go:686 +0xb6 rax 0x0 rbx 0xad rcx 0x7f9835fb4b2c rdx 0x6 rdi 0xad rsi 0xad rbp 0x7ffe92e34c90 rsp 0x7ffe92e34c50 r8 0x0 r9 0x7 r10 0x8 r11 0x246 r12 0x6 r13 0x7f97e6f63260 r14 0x16 r15 0x7f961a3c6040 rip 0x7f9835fb4b2c rflags 0x246 cs 0x33 fs 0x0 gs 0x0 [GIN] 2025/12/18 - 09:57:31 | 400 | 3.428637064s | 127.0.0.1 | POST "/api/embed" time=2025-12-18T09:57:31.081Z level=ERROR source=server.go:265 msg="llama runner terminated" error="exit status 2" time=2025-12-18T09:57:31.081Z level=DEBUG source=sched.go:537 msg="context for request finished" time=2025-12-18T09:57:31.081Z level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=173 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096 duration=5m0s time=2025-12-18T09:57:31.081Z level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/manutic/nomic-embed-code:7b-Q4_K_M runner.size="4.3 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=173 runner.model=/root/.ollama/models/blobs/sha256-08b01be72bc55f65c3f7f3ad703eecd410457b97ab86ce3c3794a98129c999c7 runner.num_ctx=4096 refCount=0 ``` Same if reduced num_ctx ```bash curl http://localhost:11434/api/embed -d '{ ∙ "model": "manutic/nomic-embed-code:7b-Q4_K_M", ∙ "input": ["Represent this query for searching relevant code: def factorial(n): return 1 if n == 0 else n * factorial(n-1)"], ∙ "options": {"num_ctx": 256} ∙ }' {"error":"do embedding request: Post \"http://127.0.0.1:41215/embedding\": EOF"} ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#29571