[GH-ISSUE #8503] Cannot overcome Ollama error : ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:35843/completion": EOF / panic: failed to decode batch: could not find a kv cache slot #5481

New Issue

GiteaMirror · 2026-04-12T16:42:35-05:00

GiteaMirror commented

2026-04-12 16:42:35 -05:00

Originally created by @user-33948 on GitHub (Jan 20, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8503

What is the issue?

I believe there is a bug in ollama's processing which flags the following two errors:

(result from running python file:) Ollama error : ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:35843/completion": EOF

(error in ollama logs:) panic: failed to decode batch: could not find a kv cache slot

To recreate: I am using Ollama as my LLM to create a property graph in Llamaindex, repeatable following the code in this documentation: https://docs.llamaindex.ai/en/stable/examples/property_graph/property_graph_advanced/

Shortly after the code begins to run, it flags the above errors with Ollama. I confirmed with LlamaIndex the problem is with Ollama, and not Llamaindex (https://github.com/run-llama/llama_index/issues/17424). The LlamaIndex team linked my issue to (https://github.com/ollama/ollama/issues/7949).

I have Ollama_DEBUG set and have the error flagged from the python script and then the Ollama logs. Request your support in fixing this issue. I am currently running Ollama v 0.5.1, but have tried with 0.3.14 per other comments in GitHub and get the same error.

Here is the error from the python script:

Traceback (most recent call last):
File "/home/LlamaIndexTutorials/main.py", line 150, in
index = PropertyGraphIndex.from_documents(
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 119, in from_documents
return cls(
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/base.py", line 134, in init
super().init(
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 77, in init
index_struct = self.build_index_from_nodes(
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 185, in build_index_from_nodes
return self._build_index_from_nodes(nodes, **build_kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/base.py", line 334, in _build_index_from_nodes
nodes = self._insert_nodes(nodes or [])
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/base.py", line 200, in _insert_nodes
nodes = asyncio.run(
File "/home/.local/lib/python3.10/site-packages/nest_asyncio.py", line 30, in run
return loop.run_until_complete(task)
File "/home/.local/lib/python3.10/site-packages/nest_asyncio.py", line 98, in run_until_complete
return f.result()
File "/usr/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/usr/lib/python3.10/asyncio/tasks.py", line 232, in __step
result = coro.send(None)
File "/home/.local/lib/python3.10/site-packages/llama_index/core/ingestion/pipeline.py", line 137, in arun_transformations
nodes = await transform.acall(nodes, **kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/transformations/schema_llm.py", line 380, in acall
return await run_jobs(
File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper
result = await func(*args, **kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/core/async_utils.py", line 146, in run_jobs
results = await tqdm_asyncio.gather(*pool_jobs, desc=desc)
File "/home/.local/lib/python3.10/site-packages/tqdm/asyncio.py", line 79, in gather
res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
File "/home/.local/lib/python3.10/site-packages/tqdm/asyncio.py", line 79, in
res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
File "/usr/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
return f.result() # May raise f.exception().
File "/usr/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/usr/lib/python3.10/asyncio/tasks.py", line 232, in __step
result = coro.send(None)
File "/home/.local/lib/python3.10/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
return i, await f
File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper
result = await func(*args, **kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/core/async_utils.py", line 139, in worker
return await job
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/transformations/schema_llm.py", line 344, in _aextract
kg_schema = await self.llm.astructured_predict(
File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper
result = await func(*args, **kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/llms/ollama/base.py", line 521, in astructured_predict
response = await self.achat(messages, **llm_kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper
result = await func(*args, **kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/core/llms/callbacks.py", line 75, in wrapped_async_llm_chat
f_return_val = await f(_self, messages, kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/llms/ollama/base.py", line 435, in achat
response = await self.async_client.chat(
File "/home/.local/lib/python3.10/site-packages/ollama/_client.py", line 834, in chat
return await self._request(
File "/home/.local/lib/python3.10/site-packages/ollama/_client.py", line 679, in _request
return cls((await self._request_raw(*args, **kwargs)).json())
File "/home/.local/lib/python3.10/site-packages/ollama/_client.py", line 624, in _request_raw
raise ResponseError(e.response.text, e.response.status_code) from None
ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:35843/completion": EOF

Here is the Ollama log:

Jan 20 09:55:34 ollama[174]: time=2025-01-20T09:55:34.271-05:00 Jan 20 09:55:34 ollama[174]: llama_model_loader: loaded Jan 20 09:55:34 ollama[174]: llama_model_loader: Dumping Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 0: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 1: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 2: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 3: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 4: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 5: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 6: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 7: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 8: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 9: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 10: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 11: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 12: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 13: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 14: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 15: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 16: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 17: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 18: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 19: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 20: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 21: Jan 20 09:55:34 ollama[174]: time=2025-01-20T09:55:34.325-05:00 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 22: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 23: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 24: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 25: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 26: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 27: Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 28: Jan 20 09:55:34 ollama[174]: llama_model_loader: - type f32: Jan 20 09:55:34 ollama[174]: llama_model_loader: - type q4_0: Jan 20 09:55:34 ollama[174]: llama_model_loader: - type q6_K: Jan 20 09:55:34 ollama[174]: llm_load_vocab: special Jan 20 09:55:34 ollama[174]: llm_load_vocab: token Jan 20 09:55:34 ollama[174]: llm_load_print_meta: format Jan 20 09:55:34 ollama[174]: llm_load_print_meta: arch Jan 20 09:55:34 ollama[174]: llm_load_print_meta: vocab type Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_vocab Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_merges Jan 20 09:55:34 ollama[174]: llm_load_print_meta: vocab_only Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_ctx_train Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_layer Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_head Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_head_kv Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_rot Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_swa Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_head_k Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_head_v Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_gqa Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_k_gqa Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_v_gqa Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_norm_eps Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_norm_rms_eps Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_clamp_kqv Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_max_alibi_bias Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_logit_scale Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_ff Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_expert Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_expert_used Jan 20 09:55:34 ollama[174]: llm_load_print_meta: causal attn Jan 20 09:55:34 ollama[174]: llm_load_print_meta: pooling type Jan 20 09:55:34 ollama[174]: llm_load_print_meta: rope type Jan 20 09:55:34 ollama[174]: llm_load_print_meta: rope scaling Jan 20 09:55:34 ollama[174]: llm_load_print_meta: freq_base_train Jan 20 09:55:34 ollama[174]: llm_load_print_meta: freq_scale_train Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_ctx_orig_yarn Jan 20 09:55:34 ollama[174]: llm_load_print_meta: rope_finetuned Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_d_conv Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_d_inner Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_d_state Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_dt_rank Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_dt_b_c_rms Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model type Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model ftype Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model params Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model size Jan 20 09:55:34 ollama[174]: llm_load_print_meta: general.name Jan 20 09:55:34 ollama[174]: llm_load_print_meta: BOS token Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOS token Jan 20 09:55:34 ollama[174]: llm_load_print_meta: LF token Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOT token Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOM token Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOG token Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOG token Jan 20 09:55:34 ollama[174]: llm_load_print_meta: max Jan 20 09:55:34 ollama[174]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: Jan 20 09:55:34 ollama[174]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: Jan 20 09:55:34 ollama[174]: ggml_cuda_init: found Jan 20 09:55:34 ollama[174]: Device 0: NVIDIA GeForce Jan 20 09:55:35 ollama[174]: llm_load_tensors: ggml ctx size = Jan 20 09:55:39 ollama[174]: llm_load_tensors: offloading Jan 20 09:55:39 ollama[174]: llm_load_tensors: offloading Jan 20 09:55:39 ollama[174]: llm_load_tensors: offloaded Jan 20 09:55:39 ollama[174]: llm_load_tensors: Jan 20 09:55:39 ollama[174]: llm_load_tensors: Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: n_ctx Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: n_batch Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: n_ubatch Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: freq_base Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: Jan 20 09:55:40 ollama[174]: llama_kv_cache_init: Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: KV self size Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: graph nodes Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: Jan 20 09:55:40 ollama[174]: time=2025-01-20T09:55:40.352-05:00 Jan 20 09:56:01 ollama[174]: [GIN] 2025/01/20 - 09:56:01 Jan 20 09:56:16 ollama[174]: [GIN] 2025/01/20 - 09:56:16 | 200 | Jan 20 09:56:20 ollama[174]: [GIN] 2025/01/20 - 09:56:20 Jan 20 09:56:23 ollama[174]: [GIN] 2025/01/20 - 09:56:23 Jan 20 09:57:02 ollama[174]: [GIN] 2025/01/20 - 09:57:02 | 200 | Jan 20 09:57:07 ollama[174]: [GIN] 2025/01/20 - 09:57:07 Jan 20 09:57:10 ollama[174]: [GIN] 2025/01/20 - 09:57:10 Jan 20 09:57:17 ollama[174]: [GIN] 2025/01/20 - 09:57:17 | 200 | Jan 20 09:57:54 ollama[174]: [GIN] 2025/01/20 - 09:57:54 Jan 20 09:57:58 ollama[174]: [GIN] 2025/01/20 - 09:57:58 Jan 20 09:58:05 ollama[174]: [GIN] 2025/01/20 - 09:58:05 Jan 20 09:58:45 ollama[174]: [GIN] 2025/01/20 - 09:58:45 Jan 20 09:59:12 ollama[174]: [GIN] 2025/01/20 - 09:59:12 | 200 | Jan 20 09:59:18 ollama[174]: [GIN] 2025/01/20 - 09:59:18 | 200 | Jan 20 09:59:37 ollama[174]: [GIN] 2025/01/20 - 09:59:37 Jan 20 10:00:11 ollama[174]: [GIN] 2025/01/20 - 10:00:11 Jan 20 10:00:22 ollama[174]: [GIN] 2025/01/20 - 10:00:22 | 200 | Jan 20 10:00:52 ollama[174]: [GIN] 2025/01/20 - 10:00:52 | 200 | Jan 20 10:01:05 ollama[174]: [GIN] 2025/01/20 - 10:01:05 Jan 20 10:01:10 ollama[174]: [GIN] 2025/01/20 - 10:01:10 Jan 20 10:01:16 ollama[174]: [GIN] 2025/01/20 - 10:01:16 Jan 20 10:01:23 ollama[174]: [GIN] 2025/01/20 - 10:01:23 Jan 20 10:01:32 ollama[174]: [GIN] 2025/01/20 - 10:01:32 Jan 20 10:01:37 ollama[174]: [GIN] 2025/01/20 - 10:01:37 | 200 | Jan 20 10:01:41 ollama[174]: [GIN] 2025/01/20 - 10:01:41 Jan 20 10:01:50 ollama[174]: [GIN] 2025/01/20 - 10:01:50 Jan 20 10:01:51 ollama[174]: panic: failed to decode Jan 20 10:01:51 ollama[174]: goroutine 7 [running]:
Jan 20 10:01:51 ollama[174]: main.(*Server).run(0xc0000ec120, Jan 20 10:01:51 ollama[174]: github.com/ollama/ollam Jan 20 10:01:51 ollama[174]: created by main.main in Jan 20 10:01:51 ollama[174]: github.com/ollama/ollam Jan 20 10:01:51 ollama[174]: [GIN] 2025/01/20 - 10:01:51 | 500 | Jan 20 10:01:51 ollama[174]: [GIN] 2025/01/20 - 10:01:51 | 500 | level=INFO source=.:0 msg="Server listening on 127.0.0.1:35843"
meta data with 29 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 (version GGUF V3 (latest))
metadata keys/values. Note: KV overrides do not apply in this output.
general.architecture str = llama
general.type str = model
general.name str = Meta Llama 3.1 8B Instruct
general.finetune str = Instruct
general.basename str = Meta-Llama-3.1
general.size_label str = 8B
general.license str = llama3.1
general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama.block_count u32 = 32
llama.context_length u32 = 131072
llama.embedding_length u32 = 4096
llama.feed_forward_length u32 = 14336
llama.attention.head_count u32 = 32
llama.attention.head_count_kv u32 = 8
llama.rope.freq_base f32 = 500000.000000
llama.attention.layer_norm_rms_epsilon f32 = 0.000010
general.file_type u32 = 2
llama.vocab_size u32 = 128256
llama.rope.dimension_count u32 = 128
tokenizer.ggml.model str = gpt2
tokenizer.ggml.pre str = llama-bpe
level=INFO source=server.go:610 msg="waiting for server to become available" status="llm server loading model"
tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
tokenizer.ggml.bos_token_id u32 = 128000
tokenizer.ggml.eos_token_id u32 = 128009
tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
general.quantization_version u32 = 2
65 tensors
225 tensors
1 tensors
tokens cache size = 256
to piece cache size = 0.7999 MB
= GGUF V3 (latest)
= llama
= BPE
= 128256
= 280147
= 0
= 131072
= 4096
= 32
= 32
= 8
= 128
= 0
= 128
= 128
= 4
= 1024
= 1024
= 0.0e+00
= 1.0e-05
= 0.0e+00
= 0.0e+00
= 0.0e+00
= 14336
= 0
= 0
= 1
= 0
= 0
= linear
= 500000.0
= 1
= 131072
= unknown
= 0
= 0
= 0
= 0
= 0
= 8B
= Q4_0
= 8.03 B
= 4.33 GiB (4.64 BPW)
= Meta Llama 3.1 8B Instruct
= 128000 '<|begin_of_text|>'
= 128009 '<|eot_id|>'
= 128 'Ä'
= 128009 '<|eot_id|>'
= 128008 '<|eom_id|>'
= 128008 '<|eom_id|>'
= 128009 '<|eot_id|>'
token length = 256
no
no
1 CUDA devices:
RTX 4090 Laptop GPU, compute capability 8.9, VMM: yes
0.27 MiB
32 repeating layers to GPU
non-repeating layers to GPU
33/33 layers to GPU
CPU buffer size = 281.81 MiB
CUDA0 buffer size = 4155.99 MiB
= 15616
= 2048
= 512
flash_attn = 0
= 500000.0
freq_scale = 1
CUDA0 KV buffer size = 1952.00 MiB
= 1952.00 MiB, K (f16): 976.00 MiB, V (f16): 976.00 MiB
CUDA_Host output buffer size = 2.02 MiB
CUDA0 compute buffer size = 1038.50 MiB
CUDA_Host compute buffer size = 38.51 MiB
= 1030
graph splits = 2
level=INFO source=server.go:615 msg="llama runner started in 6.28 seconds"
| 200 | 27.323051198s | 127.0.0.1 | POST "/api/chat"
42.63900776s | 127.0.0.1 | POST "/api/chat"
| 200 | 47.105286485s | 127.0.0.1 | POST "/api/chat"
| 200 | 49.422469993s | 127.0.0.1 | POST "/api/chat"
1m1s | 127.0.0.1 | POST "/api/chat"
| 200 | 44.852374838s | 127.0.0.1 | POST "/api/chat"
| 200 | 54.031440642s | 127.0.0.1 | POST "/api/chat"
7.628685854s | 127.0.0.1 | POST "/api/chat"
| 200 | 51.377588763s | 127.0.0.1 | POST "/api/chat"
| 200 | 50.995044466s | 127.0.0.1 | POST "/api/chat"
| 200 | 47.414462265s | 127.0.0.1 | POST "/api/chat"
| 200 | 51.214899981s | 127.0.0.1 | POST "/api/chat"
1m13s | 127.0.0.1 | POST "/api/chat"
1m13s | 127.0.0.1 | POST "/api/chat"
| 200 | 51.817729012s | 127.0.0.1 | POST "/api/chat"
| 200 | 59.507936464s | 127.0.0.1 | POST "/api/chat"
1m3s | 127.0.0.1 | POST "/api/chat"
1m15s | 127.0.0.1 | POST "/api/chat"
| 200 | 42.657578432s | 127.0.0.1 | POST "/api/chat"
| 200 | 58.529013717s | 127.0.0.1 | POST "/api/chat"
| 200 | 23.427177508s | 127.0.0.1 | POST "/api/chat"
| 200 | 17.708491186s | 127.0.0.1 | POST "/api/chat"
| 200 | 22.494067591s | 127.0.0.1 | POST "/api/chat"
21.82285361s | 127.0.0.1 | POST "/api/chat"
| 200 | 18.875235561s | 127.0.0.1 | POST "/api/chat"
| 200 | 17.608728526s | 127.0.0.1 | POST "/api/chat"
batch: could not find a kv cache slot
{0x55e6900a79a0, 0xc0000c20a0})
a/llama/runner/runner.go:344 +0x23e
goroutine 1
a/llama/runner/runner.go:980 +0xd3e
753.826293ms | 127.0.0.1 | POST "/api/chat"
5m30s | 127.0.0.1 | POST "/api/chat"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.1

Originally created by @user-33948 on GitHub (Jan 20, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8503 ### What is the issue? I believe there is a bug in ollama's processing which flags the following two errors: (result from running python file:) Ollama error : ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:35843/completion": EOF (error in ollama logs:) panic: failed to decode batch: could not find a kv cache slot To recreate: I am using Ollama as my LLM to create a property graph in Llamaindex, repeatable following the code in this documentation: https://docs.llamaindex.ai/en/stable/examples/property_graph/property_graph_advanced/ Shortly after the code begins to run, it flags the above errors with Ollama. I confirmed with LlamaIndex the problem is with Ollama, and not Llamaindex (https://github.com/run-llama/llama_index/issues/17424). The LlamaIndex team linked my issue to (https://github.com/ollama/ollama/issues/7949). I have Ollama_DEBUG set and have the error flagged from the python script and then the Ollama logs. Request your support in fixing this issue. I am currently running Ollama v 0.5.1, but have tried with 0.3.14 per other comments in GitHub and get the same error. Here is the error from the python script: Traceback (most recent call last): File "/home/LlamaIndexTutorials/main.py", line 150, in <module> index = PropertyGraphIndex.from_documents( File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 119, in from_documents return cls( File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/base.py", line 134, in __init__ super().__init__( File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 77, in __init__ index_struct = self.build_index_from_nodes( File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 185, in build_index_from_nodes return self._build_index_from_nodes(nodes, **build_kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/base.py", line 334, in _build_index_from_nodes nodes = self._insert_nodes(nodes or []) File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/base.py", line 200, in _insert_nodes nodes = asyncio.run( File "/home/.local/lib/python3.10/site-packages/nest_asyncio.py", line 30, in run return loop.run_until_complete(task) File "/home/.local/lib/python3.10/site-packages/nest_asyncio.py", line 98, in run_until_complete return f.result() File "/usr/lib/python3.10/asyncio/futures.py", line 201, in result raise self._exception.with_traceback(self._exception_tb) File "/usr/lib/python3.10/asyncio/tasks.py", line 232, in __step result = coro.send(None) File "/home/.local/lib/python3.10/site-packages/llama_index/core/ingestion/pipeline.py", line 137, in arun_transformations nodes = await transform.acall(nodes, **kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/transformations/schema_llm.py", line 380, in acall return await run_jobs( File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper result = await func(*args, **kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/core/async_utils.py", line 146, in run_jobs results = await tqdm_asyncio.gather(*pool_jobs, desc=desc) File "/home/.local/lib/python3.10/site-packages/tqdm/asyncio.py", line 79, in gather res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout, File "/home/.local/lib/python3.10/site-packages/tqdm/asyncio.py", line 79, in <listcomp> res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout, File "/usr/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one return f.result() # May raise f.exception(). File "/usr/lib/python3.10/asyncio/futures.py", line 201, in result raise self._exception.with_traceback(self._exception_tb) File "/usr/lib/python3.10/asyncio/tasks.py", line 232, in __step result = coro.send(None) File "/home/.local/lib/python3.10/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable return i, await f File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper result = await func(*args, **kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/core/async_utils.py", line 139, in worker return await job File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/transformations/schema_llm.py", line 344, in _aextract kg_schema = await self.llm.astructured_predict( File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper result = await func(*args, **kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/llms/ollama/base.py", line 521, in astructured_predict response = await self.achat(messages, **llm_kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper result = await func(*args, **kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/core/llms/callbacks.py", line 75, in wrapped_async_llm_chat f_return_val = await f(_self, messages, **kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/llms/ollama/base.py", line 435, in achat response = await self.async_client.chat( File "/home/.local/lib/python3.10/site-packages/ollama/_client.py", line 834, in chat return await self._request( File "/home/.local/lib/python3.10/site-packages/ollama/_client.py", line 679, in _request return cls(**(await self._request_raw(*args, **kwargs)).json()) File "/home/.local/lib/python3.10/site-packages/ollama/_client.py", line 624, in _request_raw raise ResponseError(e.response.text, e.response.status_code) from None ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:35843/completion": EOF Here is the Ollama log: Jan 20 09:55:34 ollama[174]: time=2025-01-20T09:55:34.271-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:35843" Jan 20 09:55:34 ollama[174]: llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 (version GGUF V3 (latest)) Jan 20 09:55:34 ollama[174]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 0: general.architecture str = llama Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 1: general.type str = model Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 3: general.finetune str = Instruct Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 5: general.size_label str = 8B Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 6: general.license str = llama3.1 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 9: llama.block_count u32 = 32 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 17: general.file_type u32 = 2 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe Jan 20 09:55:34 ollama[174]: time=2025-01-20T09:55:34.325-05:00 level=INFO source=server.go:610 msg="waiting for server to become available" status="llm server loading model" Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Jan 20 09:55:34 ollama[174]: llama_model_loader: - type f32: 65 tensors Jan 20 09:55:34 ollama[174]: llama_model_loader: - type q4_0: 225 tensors Jan 20 09:55:34 ollama[174]: llama_model_loader: - type q6_K: 1 tensors Jan 20 09:55:34 ollama[174]: llm_load_vocab: special tokens cache size = 256 Jan 20 09:55:34 ollama[174]: llm_load_vocab: token to piece cache size = 0.7999 MB Jan 20 09:55:34 ollama[174]: llm_load_print_meta: format = GGUF V3 (latest) Jan 20 09:55:34 ollama[174]: llm_load_print_meta: arch = llama Jan 20 09:55:34 ollama[174]: llm_load_print_meta: vocab type = BPE Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_vocab = 128256 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_merges = 280147 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: vocab_only = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_ctx_train = 131072 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd = 4096 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_layer = 32 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_head = 32 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_head_kv = 8 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_rot = 128 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_swa = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_head_k = 128 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_head_v = 128 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_gqa = 4 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_k_gqa = 1024 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_v_gqa = 1024 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_norm_eps = 0.0e+00 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_logit_scale = 0.0e+00 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_ff = 14336 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_expert = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_expert_used = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: causal attn = 1 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: pooling type = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: rope type = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: rope scaling = linear Jan 20 09:55:34 ollama[174]: llm_load_print_meta: freq_base_train = 500000.0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: freq_scale_train = 1 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: rope_finetuned = unknown Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_d_conv = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_d_inner = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_d_state = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_dt_rank = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model type = 8B Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model ftype = Q4_0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model params = 8.03 B Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) Jan 20 09:55:34 ollama[174]: llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct Jan 20 09:55:34 ollama[174]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Jan 20 09:55:34 ollama[174]: llm_load_print_meta: LF token = 128 'Ä' Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOM token = 128008 '<|eom_id|>' Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOG token = 128008 '<|eom_id|>' Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOG token = 128009 '<|eot_id|>' Jan 20 09:55:34 ollama[174]: llm_load_print_meta: max token length = 256 Jan 20 09:55:34 ollama[174]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Jan 20 09:55:34 ollama[174]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Jan 20 09:55:34 ollama[174]: ggml_cuda_init: found 1 CUDA devices: Jan 20 09:55:34 ollama[174]: Device 0: NVIDIA GeForce RTX 4090 Laptop GPU, compute capability 8.9, VMM: yes Jan 20 09:55:35 ollama[174]: llm_load_tensors: ggml ctx size = 0.27 MiB Jan 20 09:55:39 ollama[174]: llm_load_tensors: offloading 32 repeating layers to GPU Jan 20 09:55:39 ollama[174]: llm_load_tensors: offloading non-repeating layers to GPU Jan 20 09:55:39 ollama[174]: llm_load_tensors: offloaded 33/33 layers to GPU Jan 20 09:55:39 ollama[174]: llm_load_tensors: CPU buffer size = 281.81 MiB Jan 20 09:55:39 ollama[174]: llm_load_tensors: CUDA0 buffer size = 4155.99 MiB Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: n_ctx = 15616 Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: n_batch = 2048 Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: n_ubatch = 512 Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: flash_attn = 0 Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: freq_base = 500000.0 Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: freq_scale = 1 Jan 20 09:55:40 ollama[174]: llama_kv_cache_init: CUDA0 KV buffer size = 1952.00 MiB Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: KV self size = 1952.00 MiB, K (f16): 976.00 MiB, V (f16): 976.00 MiB Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: CUDA0 compute buffer size = 1038.50 MiB Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: CUDA_Host compute buffer size = 38.51 MiB Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: graph nodes = 1030 Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: graph splits = 2 Jan 20 09:55:40 ollama[174]: time=2025-01-20T09:55:40.352-05:00 level=INFO source=server.go:615 msg="llama runner started in 6.28 seconds" Jan 20 09:56:01 ollama[174]: [GIN] 2025/01/20 - 09:56:01 | 200 | 27.323051198s | 127.0.0.1 | POST "/api/chat" Jan 20 09:56:16 ollama[174]: [GIN] 2025/01/20 - 09:56:16 | 200 | 42.63900776s | 127.0.0.1 | POST "/api/chat" Jan 20 09:56:20 ollama[174]: [GIN] 2025/01/20 - 09:56:20 | 200 | 47.105286485s | 127.0.0.1 | POST "/api/chat" Jan 20 09:56:23 ollama[174]: [GIN] 2025/01/20 - 09:56:23 | 200 | 49.422469993s | 127.0.0.1 | POST "/api/chat" Jan 20 09:57:02 ollama[174]: [GIN] 2025/01/20 - 09:57:02 | 200 | 1m1s | 127.0.0.1 | POST "/api/chat" Jan 20 09:57:07 ollama[174]: [GIN] 2025/01/20 - 09:57:07 | 200 | 44.852374838s | 127.0.0.1 | POST "/api/chat" Jan 20 09:57:10 ollama[174]: [GIN] 2025/01/20 - 09:57:10 | 200 | 54.031440642s | 127.0.0.1 | POST "/api/chat" Jan 20 09:57:17 ollama[174]: [GIN] 2025/01/20 - 09:57:17 | 200 | 7.628685854s | 127.0.0.1 | POST "/api/chat" Jan 20 09:57:54 ollama[174]: [GIN] 2025/01/20 - 09:57:54 | 200 | 51.377588763s | 127.0.0.1 | POST "/api/chat" Jan 20 09:57:58 ollama[174]: [GIN] 2025/01/20 - 09:57:58 | 200 | 50.995044466s | 127.0.0.1 | POST "/api/chat" Jan 20 09:58:05 ollama[174]: [GIN] 2025/01/20 - 09:58:05 | 200 | 47.414462265s | 127.0.0.1 | POST "/api/chat" Jan 20 09:58:45 ollama[174]: [GIN] 2025/01/20 - 09:58:45 | 200 | 51.214899981s | 127.0.0.1 | POST "/api/chat" Jan 20 09:59:12 ollama[174]: [GIN] 2025/01/20 - 09:59:12 | 200 | 1m13s | 127.0.0.1 | POST "/api/chat" Jan 20 09:59:18 ollama[174]: [GIN] 2025/01/20 - 09:59:18 | 200 | 1m13s | 127.0.0.1 | POST "/api/chat" Jan 20 09:59:37 ollama[174]: [GIN] 2025/01/20 - 09:59:37 | 200 | 51.817729012s | 127.0.0.1 | POST "/api/chat" Jan 20 10:00:11 ollama[174]: [GIN] 2025/01/20 - 10:00:11 | 200 | 59.507936464s | 127.0.0.1 | POST "/api/chat" Jan 20 10:00:22 ollama[174]: [GIN] 2025/01/20 - 10:00:22 | 200 | 1m3s | 127.0.0.1 | POST "/api/chat" Jan 20 10:00:52 ollama[174]: [GIN] 2025/01/20 - 10:00:52 | 200 | 1m15s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:05 ollama[174]: [GIN] 2025/01/20 - 10:01:05 | 200 | 42.657578432s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:10 ollama[174]: [GIN] 2025/01/20 - 10:01:10 | 200 | 58.529013717s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:16 ollama[174]: [GIN] 2025/01/20 - 10:01:16 | 200 | 23.427177508s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:23 ollama[174]: [GIN] 2025/01/20 - 10:01:23 | 200 | 17.708491186s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:32 ollama[174]: [GIN] 2025/01/20 - 10:01:32 | 200 | 22.494067591s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:37 ollama[174]: [GIN] 2025/01/20 - 10:01:37 | 200 | 21.82285361s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:41 ollama[174]: [GIN] 2025/01/20 - 10:01:41 | 200 | 18.875235561s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:50 ollama[174]: [GIN] 2025/01/20 - 10:01:50 | 200 | 17.608728526s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:51 ollama[174]: panic: failed to decode batch: could not find a kv cache slot Jan 20 10:01:51 ollama[174]: goroutine 7 [running]: Jan 20 10:01:51 ollama[174]: main.(*Server).run(0xc0000ec120, {0x55e6900a79a0, 0xc0000c20a0}) Jan 20 10:01:51 ollama[174]: github.com/ollama/ollama/llama/runner/runner.go:344 +0x23e Jan 20 10:01:51 ollama[174]: created by main.main in goroutine 1 Jan 20 10:01:51 ollama[174]: github.com/ollama/ollama/llama/runner/runner.go:980 +0xd3e Jan 20 10:01:51 ollama[174]: [GIN] 2025/01/20 - 10:01:51 | 500 | 753.826293ms | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:51 ollama[174]: [GIN] 2025/01/20 - 10:01:51 | 500 | 5m30s | 127.0.0.1 | POST "/api/chat" ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.1

GiteaMirror added the bug label 2026-04-12 16:42:35 -05:00

GiteaMirror closed this issue

2026-04-12 16:42:36 -05:00

GiteaMirror commented

2026-04-12 16:42:36 -05:00

@rick-github commented on GitHub (Jan 21, 2025):

Try increasing context size (num_ctx).

@rick-github commented on GitHub (Jan 21, 2025): Try increasing context size (`num_ctx`).

GiteaMirror commented

2026-04-12 16:42:37 -05:00

@user-33948 commented on GitHub (Jan 24, 2025):

@rick-github Will do, what would you recommend setting it to?

I have tried previously doing this and it didn't work but can try again with whatever you recommend.

@user-33948 commented on GitHub (Jan 24, 2025): @rick-github Will do, what would you recommend setting it to? I have tried previously doing this and it didn't work but can try again with whatever you recommend.

GiteaMirror commented

2026-04-12 16:42:37 -05:00

@rick-github commented on GitHub (Jan 24, 2025):

It depends on what data you sending. From the logs, you have it set to 15616. You have a GeForce RTX 4090, yet ollama is sometimes spending over a minute processing data. This implies quite a large prompt. If nothing else is using your GPU you'd have about 15G free, so you could bump num_ctx to 80000 without too much trouble. If you enable verbose debugging with OLLAMA_DEBUG=1 there may be information about the processing that the ollama runner is doing that might shed light on why it can't find a kv cache slot.

@rick-github commented on GitHub (Jan 24, 2025): It depends on what data you sending. From the logs, you have it set to 15616. You have a GeForce RTX 4090, yet ollama is sometimes spending over a minute processing data. This implies quite a large prompt. If nothing else is using your GPU you'd have about 15G free, so you could bump `num_ctx` to 80000 without too much trouble. If you enable verbose debugging with `OLLAMA_DEBUG=1` there may be information about the processing that the ollama runner is doing that might shed light on why it can't find a kv cache slot.

GiteaMirror commented

2026-04-12 16:42:38 -05:00

@viba1 commented on GitHub (Jan 26, 2025):

Hi all,

I also encounter a "Error: POST predict: Post "http://127.0.0.1:46415/completion": EOF", using phi4, wich is a 14b model, but don't know if linked to previous.

I changed num_ctx to 8192, and obtain the same problem.

ollama.log

See complete DEBUG logs in attachment.

I don't encounters the problem with other 14b models, deepseek-r1:14b or gemma2:27b which are both using CPU and GPU.

AMD GPU
GTX 980Ti (6Gb VRAM)
32Gb DDRAM
DEBIAN 12
NVIDIA DRIVER is 535.216.01
CUDA DRIVER is 12.2

@viba1 commented on GitHub (Jan 26, 2025): Hi all, I also encounter a "Error: POST predict: Post "http://127.0.0.1:46415/completion": EOF", using phi4, wich is a 14b model, but don't know if linked to previous. I changed num_ctx to 8192, and obtain the same problem. [ollama.log](https://github.com/user-attachments/files/18550036/ollama.log) See complete DEBUG logs in attachment. I don't encounters the problem with other 14b models, deepseek-r1:14b or gemma2:27b which are both using CPU and GPU. AMD GPU GTX 980Ti (6Gb VRAM) 32Gb DDRAM DEBIAN 12 NVIDIA DRIVER is 535.216.01 CUDA DRIVER is 12.2

GiteaMirror commented

2026-04-12 16:42:38 -05:00

@rick-github commented on GitHub (Jan 26, 2025):

janv. 26 13:02:43  ollama[1393]: time=2025-01-26T13:02:43.755+01:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=20 layers.split="" memory.available="[5.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.4 GiB" memory.required.partial="5.5 GiB" memory.required.kv="800.0 MiB" memory.required.allocations="[5.5 GiB]" memory.weights.total="8.5 GiB" memory.weights.repeating="8.2 GiB" memory.weights.nonrepeating="402.0 MiB" memory.graph.full="533.3 MiB" memory.graph.partial="533.3 MiB"

This is a different problem. You have 5.5GiB free, the model needs 10.4GiB, so ollama is doing a partial load. It's allocating 5.5GiB, ie taking all VRAM for the initial model load. This doesn't leave enough room for transient allocations and the runner OOMs. There are some migitations you can take:

Set OLLAMA_GPU_OVERHEAD to give llama.cpp a buffer to grow in to (eg, OLLAMA_GPU_OVERHEAD=536870912 to reserve 512M)
Enable flash attention by setting OLLAMA_FLASH_ATTENTION=1 in the server environment. Flash attention is a more efficient use of memory and may reduce memory pressure.
Reduce the number layers that ollama thinks it can offload to the GPU, see here. Ollama is currently offloading 20 layers, try setting num_gpu to 15.
Set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1. This will allow the GPU to offload to CPU memory if VRAM is exhausted. This is only useful for small amounts of memory as there is a performance penalty. However, in the case where the goal is to reduce OOMs, the amount offloaded will be small and the impact minimal.

@rick-github commented on GitHub (Jan 26, 2025): ``` janv. 26 13:02:43 ollama[1393]: time=2025-01-26T13:02:43.755+01:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=20 layers.split="" memory.available="[5.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.4 GiB" memory.required.partial="5.5 GiB" memory.required.kv="800.0 MiB" memory.required.allocations="[5.5 GiB]" memory.weights.total="8.5 GiB" memory.weights.repeating="8.2 GiB" memory.weights.nonrepeating="402.0 MiB" memory.graph.full="533.3 MiB" memory.graph.partial="533.3 MiB" ``` This is a different problem. You have 5.5GiB free, the model needs 10.4GiB, so ollama is doing a partial load. It's allocating 5.5GiB, ie taking all VRAM for the initial model load. This doesn't leave enough room for transient allocations and the runner OOMs. There are some migitations you can take: 1. Set [`OLLAMA_GPU_OVERHEAD`](https://github.com/ollama/ollama/blob/5f8051180e3b9aeafc153f6b5056e7358a939c88/envconfig/config.go#L237) to give llama.cpp a buffer to grow in to (eg, `OLLAMA_GPU_OVERHEAD=536870912` to reserve 512M) 2. Enable flash attention by setting [`OLLAMA_FLASH_ATTENTION=1`](https://github.com/ollama/ollama/blob/5f8051180e3b9aeafc153f6b5056e7358a939c88/envconfig/config.go#L236) in the server environment. Flash attention is a more efficient use of memory and may reduce memory pressure. 3. Reduce the number layers that ollama thinks it can offload to the GPU, see [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650). Ollama is currently offloading 20 layers, try setting `num_gpu` to 15. 4. Set `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`. This will allow the GPU to offload to CPU memory if VRAM is exhausted. This is only useful for small amounts of memory as there is a [performance penalty](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900). However, in the case where the goal is to reduce OOMs, the amount offloaded will be small and the impact minimal.

GiteaMirror commented

2026-04-12 16:42:38 -05:00

@liuliwei91 commented on GitHub (Jan 27, 2025):

I'm getting the same error::ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:40457/completion": EOF;

Don't use "systemctl" to start ollama - use "ollama serve" instead. The services started by these two methods are isolated, so the previously downloaded models cannot be found.

@liuliwei91 commented on GitHub (Jan 27, 2025): I'm getting the same error::ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:40457/completion": EOF; Don't use "systemctl" to start ollama - use "ollama serve" instead. The services started by these two methods are isolated, so the previously downloaded models cannot be found.

GiteaMirror commented

2026-04-12 16:42:38 -05:00

@rick-github commented on GitHub (Jan 27, 2025):

Your error is not the same as the original post. Create a new issue and add server logs.

@rick-github commented on GitHub (Jan 27, 2025): Your error is not the same as the original post. Create a new issue and add [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues).

GiteaMirror commented

2026-04-12 16:42:39 -05:00

@user-33948 commented on GitHub (Feb 1, 2025):

It depends on what data you sending. From the logs, you have it set to 15616. You have a GeForce RTX 4090, yet ollama is sometimes spending over a minute processing data. This implies quite a large prompt. If nothing else is using your GPU you'd have about 15G free, so you could bump num_ctx to 80000 without too much trouble. If you enable verbose debugging with OLLAMA_DEBUG=1 there may be information about the processing that the ollama runner is doing that might shed light on why it can't find a kv cache slot.

I tried changing the num_ctx but when I look in the Ollama logs, it's clear the model didn't actually accept an updated context window. See code below:

os.environ['OLLAMA_DEBUG'] = '1'

kg_extractor = SchemaLLMPathExtractor(
llm= Ollama(model="llama3.1", num_ctx=80000, json_mode=True, request_timeout=3600),
possible_entities=entities,
possible_relations=relations,
kg_validation_schema=validation_schema,
strict=True,
)

After running this and checking the logs, I still get:
ollama[186]: llama_new_context_with_model: n_ctx = 15616

Request help updating context window. Thanks.

@user-33948 commented on GitHub (Feb 1, 2025): > It depends on what data you sending. From the logs, you have it set to 15616. You have a GeForce RTX 4090, yet ollama is sometimes spending over a minute processing data. This implies quite a large prompt. If nothing else is using your GPU you'd have about 15G free, so you could bump `num_ctx` to 80000 without too much trouble. If you enable verbose debugging with `OLLAMA_DEBUG=1` there may be information about the processing that the ollama runner is doing that might shed light on why it can't find a kv cache slot. I tried changing the num_ctx but when I look in the Ollama logs, it's clear the model didn't actually accept an updated context window. See code below: os.environ['OLLAMA_DEBUG'] = '1' kg_extractor = SchemaLLMPathExtractor( llm= Ollama(model="llama3.1", num_ctx=80000, json_mode=True, request_timeout=3600), possible_entities=entities, possible_relations=relations, kg_validation_schema=validation_schema, strict=True, ) After running this and checking the logs, I still get: ollama[186]: llama_new_context_with_model: n_ctx = 15616 Request help updating context window. Thanks.

GiteaMirror commented

2026-04-12 16:42:39 -05:00

@rick-github commented on GitHub (Feb 6, 2025):

You have to set OLLAMA_DEBUG=1 in the server environment.

llm= Ollama(model="llama3.1", num_ctx=80000, json_mode=True, request_timeout=3600),

You don't give enough information to determine how your client is using the framework. Where is Ollama defined?

@rick-github commented on GitHub (Feb 6, 2025): You have to set `OLLAMA_DEBUG=1` in the server environment. > llm= Ollama(model="llama3.1", num_ctx=80000, json_mode=True, request_timeout=3600), You don't give enough information to determine how your client is using the framework. Where is `Ollama` defined?

GiteaMirror referenced this issue

2026-04-22 07:53:34 -05:00

[GH-ISSUE #5481] Time waste in API remote call #29187

GiteaMirror referenced this issue

2026-04-28 13:29:47 -05:00

[GH-ISSUE #5481] Time waste in API remote call #49938

GiteaMirror referenced this issue

2026-05-03 21:23:44 -05:00

[GH-ISSUE #5481] Time waste in API remote call #65464

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#5481