[GH-ISSUE #8503] Cannot overcome Ollama error : ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:35843/completion": EOF / panic: failed to decode batch: could not find a kv cache slot #5481

Closed
opened 2026-04-12 16:42:35 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @user-33948 on GitHub (Jan 20, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8503

What is the issue?

I believe there is a bug in ollama's processing which flags the following two errors:

(result from running python file:) Ollama error : ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:35843/completion": EOF

(error in ollama logs:) panic: failed to decode batch: could not find a kv cache slot

To recreate: I am using Ollama as my LLM to create a property graph in Llamaindex, repeatable following the code in this documentation: https://docs.llamaindex.ai/en/stable/examples/property_graph/property_graph_advanced/

Shortly after the code begins to run, it flags the above errors with Ollama. I confirmed with LlamaIndex the problem is with Ollama, and not Llamaindex (https://github.com/run-llama/llama_index/issues/17424). The LlamaIndex team linked my issue to (https://github.com/ollama/ollama/issues/7949).

I have Ollama_DEBUG set and have the error flagged from the python script and then the Ollama logs. Request your support in fixing this issue. I am currently running Ollama v 0.5.1, but have tried with 0.3.14 per other comments in GitHub and get the same error.

Here is the error from the python script:

Traceback (most recent call last):
File "/home/LlamaIndexTutorials/main.py", line 150, in
index = PropertyGraphIndex.from_documents(
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 119, in from_documents
return cls(
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/base.py", line 134, in init
super().init(
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 77, in init
index_struct = self.build_index_from_nodes(
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 185, in build_index_from_nodes
return self._build_index_from_nodes(nodes, **build_kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/base.py", line 334, in _build_index_from_nodes
nodes = self._insert_nodes(nodes or [])
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/base.py", line 200, in _insert_nodes
nodes = asyncio.run(
File "/home/.local/lib/python3.10/site-packages/nest_asyncio.py", line 30, in run
return loop.run_until_complete(task)
File "/home/.local/lib/python3.10/site-packages/nest_asyncio.py", line 98, in run_until_complete
return f.result()
File "/usr/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/usr/lib/python3.10/asyncio/tasks.py", line 232, in __step
result = coro.send(None)
File "/home/.local/lib/python3.10/site-packages/llama_index/core/ingestion/pipeline.py", line 137, in arun_transformations
nodes = await transform.acall(nodes, **kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/transformations/schema_llm.py", line 380, in acall
return await run_jobs(
File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper
result = await func(*args, **kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/core/async_utils.py", line 146, in run_jobs
results = await tqdm_asyncio.gather(*pool_jobs, desc=desc)
File "/home/.local/lib/python3.10/site-packages/tqdm/asyncio.py", line 79, in gather
res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
File "/home/.local/lib/python3.10/site-packages/tqdm/asyncio.py", line 79, in
res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
File "/usr/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
return f.result() # May raise f.exception().
File "/usr/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/usr/lib/python3.10/asyncio/tasks.py", line 232, in __step
result = coro.send(None)
File "/home/.local/lib/python3.10/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
return i, await f
File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper
result = await func(*args, **kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/core/async_utils.py", line 139, in worker
return await job
File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/transformations/schema_llm.py", line 344, in _aextract
kg_schema = await self.llm.astructured_predict(
File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper
result = await func(*args, **kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/llms/ollama/base.py", line 521, in astructured_predict
response = await self.achat(messages, **llm_kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper
result = await func(*args, **kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/core/llms/callbacks.py", line 75, in wrapped_async_llm_chat
f_return_val = await f(_self, messages, kwargs)
File "/home/.local/lib/python3.10/site-packages/llama_index/llms/ollama/base.py", line 435, in achat
response = await self.async_client.chat(
File "/home/.local/lib/python3.10/site-packages/ollama/_client.py", line 834, in chat
return await self._request(
File "/home/.local/lib/python3.10/site-packages/ollama/_client.py", line 679, in _request
return cls(
(await self._request_raw(*args, **kwargs)).json())
File "/home/.local/lib/python3.10/site-packages/ollama/_client.py", line 624, in _request_raw
raise ResponseError(e.response.text, e.response.status_code) from None
ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:35843/completion": EOF

Here is the Ollama log:

Jan 20 09:55:34 ollama[174]: time=2025-01-20T09:55:34.271-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:35843"
Jan 20 09:55:34 ollama[174]: llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 (version GGUF V3 (latest))
Jan 20 09:55:34 ollama[174]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 0: general.architecture str = llama
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 1: general.type str = model
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 3: general.finetune str = Instruct
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 5: general.size_label str = 8B
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 6: general.license str = llama3.1
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 9: llama.block_count u32 = 32
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 10: llama.context_length u32 = 131072
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 17: general.file_type u32 = 2
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
Jan 20 09:55:34 ollama[174]: time=2025-01-20T09:55:34.325-05:00 level=INFO source=server.go:610 msg="waiting for server to become available" status="llm server loading model"
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 28: general.quantization_version u32 = 2
Jan 20 09:55:34 ollama[174]: llama_model_loader: - type f32: 65 tensors
Jan 20 09:55:34 ollama[174]: llama_model_loader: - type q4_0: 225 tensors
Jan 20 09:55:34 ollama[174]: llama_model_loader: - type q6_K: 1 tensors
Jan 20 09:55:34 ollama[174]: llm_load_vocab: special tokens cache size = 256
Jan 20 09:55:34 ollama[174]: llm_load_vocab: token to piece cache size = 0.7999 MB
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: format = GGUF V3 (latest)
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: arch = llama
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: vocab type = BPE
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_vocab = 128256
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_merges = 280147
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: vocab_only = 0
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_ctx_train = 131072
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd = 4096
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_layer = 32
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_head = 32
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_head_kv = 8
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_rot = 128
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_swa = 0
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_head_k = 128
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_head_v = 128
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_gqa = 4
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_k_gqa = 1024
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_v_gqa = 1024
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_norm_eps = 0.0e+00
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_logit_scale = 0.0e+00
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_ff = 14336
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_expert = 0
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_expert_used = 0
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: causal attn = 1
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: pooling type = 0
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: rope type = 0
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: rope scaling = linear
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: freq_base_train = 500000.0
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: freq_scale_train = 1
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_ctx_orig_yarn = 131072
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: rope_finetuned = unknown
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_d_conv = 0
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_d_inner = 0
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_d_state = 0
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_dt_rank = 0
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_dt_b_c_rms = 0
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model type = 8B
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model ftype = Q4_0
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model params = 8.03 B
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: LF token = 128 'Ä'
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
Jan 20 09:55:34 ollama[174]: llm_load_print_meta: max token length = 256
Jan 20 09:55:34 ollama[174]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
Jan 20 09:55:34 ollama[174]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Jan 20 09:55:34 ollama[174]: ggml_cuda_init: found 1 CUDA devices:
Jan 20 09:55:34 ollama[174]: Device 0: NVIDIA GeForce RTX 4090 Laptop GPU, compute capability 8.9, VMM: yes
Jan 20 09:55:35 ollama[174]: llm_load_tensors: ggml ctx size = 0.27 MiB
Jan 20 09:55:39 ollama[174]: llm_load_tensors: offloading 32 repeating layers to GPU
Jan 20 09:55:39 ollama[174]: llm_load_tensors: offloading non-repeating layers to GPU
Jan 20 09:55:39 ollama[174]: llm_load_tensors: offloaded 33/33 layers to GPU
Jan 20 09:55:39 ollama[174]: llm_load_tensors: CPU buffer size = 281.81 MiB
Jan 20 09:55:39 ollama[174]: llm_load_tensors: CUDA0 buffer size = 4155.99 MiB
Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: n_ctx = 15616
Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: n_batch = 2048
Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: n_ubatch = 512
Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: flash_attn = 0
Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: freq_base = 500000.0
Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: freq_scale = 1
Jan 20 09:55:40 ollama[174]: llama_kv_cache_init: CUDA0 KV buffer size = 1952.00 MiB
Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: KV self size = 1952.00 MiB, K (f16): 976.00 MiB, V (f16): 976.00 MiB
Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB
Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: CUDA0 compute buffer size = 1038.50 MiB
Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: CUDA_Host compute buffer size = 38.51 MiB
Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: graph nodes = 1030
Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: graph splits = 2
Jan 20 09:55:40 ollama[174]: time=2025-01-20T09:55:40.352-05:00 level=INFO source=server.go:615 msg="llama runner started in 6.28 seconds"
Jan 20 09:56:01 ollama[174]: [GIN] 2025/01/20 - 09:56:01 | 200 | 27.323051198s | 127.0.0.1 | POST "/api/chat"
Jan 20 09:56:16 ollama[174]: [GIN] 2025/01/20 - 09:56:16 | 200 | 42.63900776s | 127.0.0.1 | POST "/api/chat"
Jan 20 09:56:20 ollama[174]: [GIN] 2025/01/20 - 09:56:20 | 200 | 47.105286485s | 127.0.0.1 | POST "/api/chat"
Jan 20 09:56:23 ollama[174]: [GIN] 2025/01/20 - 09:56:23 | 200 | 49.422469993s | 127.0.0.1 | POST "/api/chat"
Jan 20 09:57:02 ollama[174]: [GIN] 2025/01/20 - 09:57:02 | 200 | 1m1s | 127.0.0.1 | POST "/api/chat"
Jan 20 09:57:07 ollama[174]: [GIN] 2025/01/20 - 09:57:07 | 200 | 44.852374838s | 127.0.0.1 | POST "/api/chat"
Jan 20 09:57:10 ollama[174]: [GIN] 2025/01/20 - 09:57:10 | 200 | 54.031440642s | 127.0.0.1 | POST "/api/chat"
Jan 20 09:57:17 ollama[174]: [GIN] 2025/01/20 - 09:57:17 | 200 | 7.628685854s | 127.0.0.1 | POST "/api/chat"
Jan 20 09:57:54 ollama[174]: [GIN] 2025/01/20 - 09:57:54 | 200 | 51.377588763s | 127.0.0.1 | POST "/api/chat"
Jan 20 09:57:58 ollama[174]: [GIN] 2025/01/20 - 09:57:58 | 200 | 50.995044466s | 127.0.0.1 | POST "/api/chat"
Jan 20 09:58:05 ollama[174]: [GIN] 2025/01/20 - 09:58:05 | 200 | 47.414462265s | 127.0.0.1 | POST "/api/chat"
Jan 20 09:58:45 ollama[174]: [GIN] 2025/01/20 - 09:58:45 | 200 | 51.214899981s | 127.0.0.1 | POST "/api/chat"
Jan 20 09:59:12 ollama[174]: [GIN] 2025/01/20 - 09:59:12 | 200 | 1m13s | 127.0.0.1 | POST "/api/chat"
Jan 20 09:59:18 ollama[174]: [GIN] 2025/01/20 - 09:59:18 | 200 | 1m13s | 127.0.0.1 | POST "/api/chat"
Jan 20 09:59:37 ollama[174]: [GIN] 2025/01/20 - 09:59:37 | 200 | 51.817729012s | 127.0.0.1 | POST "/api/chat"
Jan 20 10:00:11 ollama[174]: [GIN] 2025/01/20 - 10:00:11 | 200 | 59.507936464s | 127.0.0.1 | POST "/api/chat"
Jan 20 10:00:22 ollama[174]: [GIN] 2025/01/20 - 10:00:22 | 200 | 1m3s | 127.0.0.1 | POST "/api/chat"
Jan 20 10:00:52 ollama[174]: [GIN] 2025/01/20 - 10:00:52 | 200 | 1m15s | 127.0.0.1 | POST "/api/chat"
Jan 20 10:01:05 ollama[174]: [GIN] 2025/01/20 - 10:01:05 | 200 | 42.657578432s | 127.0.0.1 | POST "/api/chat"
Jan 20 10:01:10 ollama[174]: [GIN] 2025/01/20 - 10:01:10 | 200 | 58.529013717s | 127.0.0.1 | POST "/api/chat"
Jan 20 10:01:16 ollama[174]: [GIN] 2025/01/20 - 10:01:16 | 200 | 23.427177508s | 127.0.0.1 | POST "/api/chat"
Jan 20 10:01:23 ollama[174]: [GIN] 2025/01/20 - 10:01:23 | 200 | 17.708491186s | 127.0.0.1 | POST "/api/chat"
Jan 20 10:01:32 ollama[174]: [GIN] 2025/01/20 - 10:01:32 | 200 | 22.494067591s | 127.0.0.1 | POST "/api/chat"
Jan 20 10:01:37 ollama[174]: [GIN] 2025/01/20 - 10:01:37 | 200 | 21.82285361s | 127.0.0.1 | POST "/api/chat"
Jan 20 10:01:41 ollama[174]: [GIN] 2025/01/20 - 10:01:41 | 200 | 18.875235561s | 127.0.0.1 | POST "/api/chat"
Jan 20 10:01:50 ollama[174]: [GIN] 2025/01/20 - 10:01:50 | 200 | 17.608728526s | 127.0.0.1 | POST "/api/chat"
Jan 20 10:01:51 ollama[174]: panic: failed to decode batch: could not find a kv cache slot
Jan 20 10:01:51 ollama[174]: goroutine 7 [running]:
Jan 20 10:01:51 ollama[174]: main.(*Server).run(0xc0000ec120, {0x55e6900a79a0, 0xc0000c20a0})
Jan 20 10:01:51 ollama[174]: github.com/ollama/ollama/llama/runner/runner.go:344 +0x23e
Jan 20 10:01:51 ollama[174]: created by main.main in goroutine 1
Jan 20 10:01:51 ollama[174]: github.com/ollama/ollama/llama/runner/runner.go:980 +0xd3e
Jan 20 10:01:51 ollama[174]: [GIN] 2025/01/20 - 10:01:51 | 500 | 753.826293ms | 127.0.0.1 | POST "/api/chat"
Jan 20 10:01:51 ollama[174]: [GIN] 2025/01/20 - 10:01:51 | 500 | 5m30s | 127.0.0.1 | POST "/api/chat"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.1

Originally created by @user-33948 on GitHub (Jan 20, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8503 ### What is the issue? I believe there is a bug in ollama's processing which flags the following two errors: (result from running python file:) Ollama error : ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:35843/completion": EOF (error in ollama logs:) panic: failed to decode batch: could not find a kv cache slot To recreate: I am using Ollama as my LLM to create a property graph in Llamaindex, repeatable following the code in this documentation: https://docs.llamaindex.ai/en/stable/examples/property_graph/property_graph_advanced/ Shortly after the code begins to run, it flags the above errors with Ollama. I confirmed with LlamaIndex the problem is with Ollama, and not Llamaindex (https://github.com/run-llama/llama_index/issues/17424). The LlamaIndex team linked my issue to (https://github.com/ollama/ollama/issues/7949). I have Ollama_DEBUG set and have the error flagged from the python script and then the Ollama logs. Request your support in fixing this issue. I am currently running Ollama v 0.5.1, but have tried with 0.3.14 per other comments in GitHub and get the same error. Here is the error from the python script: Traceback (most recent call last): File "/home/LlamaIndexTutorials/main.py", line 150, in <module> index = PropertyGraphIndex.from_documents( File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 119, in from_documents return cls( File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/base.py", line 134, in __init__ super().__init__( File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 77, in __init__ index_struct = self.build_index_from_nodes( File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 185, in build_index_from_nodes return self._build_index_from_nodes(nodes, **build_kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/base.py", line 334, in _build_index_from_nodes nodes = self._insert_nodes(nodes or []) File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/base.py", line 200, in _insert_nodes nodes = asyncio.run( File "/home/.local/lib/python3.10/site-packages/nest_asyncio.py", line 30, in run return loop.run_until_complete(task) File "/home/.local/lib/python3.10/site-packages/nest_asyncio.py", line 98, in run_until_complete return f.result() File "/usr/lib/python3.10/asyncio/futures.py", line 201, in result raise self._exception.with_traceback(self._exception_tb) File "/usr/lib/python3.10/asyncio/tasks.py", line 232, in __step result = coro.send(None) File "/home/.local/lib/python3.10/site-packages/llama_index/core/ingestion/pipeline.py", line 137, in arun_transformations nodes = await transform.acall(nodes, **kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/transformations/schema_llm.py", line 380, in acall return await run_jobs( File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper result = await func(*args, **kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/core/async_utils.py", line 146, in run_jobs results = await tqdm_asyncio.gather(*pool_jobs, desc=desc) File "/home/.local/lib/python3.10/site-packages/tqdm/asyncio.py", line 79, in gather res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout, File "/home/.local/lib/python3.10/site-packages/tqdm/asyncio.py", line 79, in <listcomp> res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout, File "/usr/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one return f.result() # May raise f.exception(). File "/usr/lib/python3.10/asyncio/futures.py", line 201, in result raise self._exception.with_traceback(self._exception_tb) File "/usr/lib/python3.10/asyncio/tasks.py", line 232, in __step result = coro.send(None) File "/home/.local/lib/python3.10/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable return i, await f File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper result = await func(*args, **kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/core/async_utils.py", line 139, in worker return await job File "/home/.local/lib/python3.10/site-packages/llama_index/core/indices/property_graph/transformations/schema_llm.py", line 344, in _aextract kg_schema = await self.llm.astructured_predict( File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper result = await func(*args, **kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/llms/ollama/base.py", line 521, in astructured_predict response = await self.achat(messages, **llm_kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 367, in async_wrapper result = await func(*args, **kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/core/llms/callbacks.py", line 75, in wrapped_async_llm_chat f_return_val = await f(_self, messages, **kwargs) File "/home/.local/lib/python3.10/site-packages/llama_index/llms/ollama/base.py", line 435, in achat response = await self.async_client.chat( File "/home/.local/lib/python3.10/site-packages/ollama/_client.py", line 834, in chat return await self._request( File "/home/.local/lib/python3.10/site-packages/ollama/_client.py", line 679, in _request return cls(**(await self._request_raw(*args, **kwargs)).json()) File "/home/.local/lib/python3.10/site-packages/ollama/_client.py", line 624, in _request_raw raise ResponseError(e.response.text, e.response.status_code) from None ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:35843/completion": EOF Here is the Ollama log: Jan 20 09:55:34 ollama[174]: time=2025-01-20T09:55:34.271-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:35843" Jan 20 09:55:34 ollama[174]: llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 (version GGUF V3 (latest)) Jan 20 09:55:34 ollama[174]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 0: general.architecture str = llama Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 1: general.type str = model Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 3: general.finetune str = Instruct Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 5: general.size_label str = 8B Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 6: general.license str = llama3.1 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 9: llama.block_count u32 = 32 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 17: general.file_type u32 = 2 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe Jan 20 09:55:34 ollama[174]: time=2025-01-20T09:55:34.325-05:00 level=INFO source=server.go:610 msg="waiting for server to become available" status="llm server loading model" Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... Jan 20 09:55:34 ollama[174]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Jan 20 09:55:34 ollama[174]: llama_model_loader: - type f32: 65 tensors Jan 20 09:55:34 ollama[174]: llama_model_loader: - type q4_0: 225 tensors Jan 20 09:55:34 ollama[174]: llama_model_loader: - type q6_K: 1 tensors Jan 20 09:55:34 ollama[174]: llm_load_vocab: special tokens cache size = 256 Jan 20 09:55:34 ollama[174]: llm_load_vocab: token to piece cache size = 0.7999 MB Jan 20 09:55:34 ollama[174]: llm_load_print_meta: format = GGUF V3 (latest) Jan 20 09:55:34 ollama[174]: llm_load_print_meta: arch = llama Jan 20 09:55:34 ollama[174]: llm_load_print_meta: vocab type = BPE Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_vocab = 128256 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_merges = 280147 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: vocab_only = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_ctx_train = 131072 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd = 4096 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_layer = 32 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_head = 32 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_head_kv = 8 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_rot = 128 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_swa = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_head_k = 128 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_head_v = 128 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_gqa = 4 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_k_gqa = 1024 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_embd_v_gqa = 1024 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_norm_eps = 0.0e+00 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: f_logit_scale = 0.0e+00 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_ff = 14336 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_expert = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_expert_used = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: causal attn = 1 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: pooling type = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: rope type = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: rope scaling = linear Jan 20 09:55:34 ollama[174]: llm_load_print_meta: freq_base_train = 500000.0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: freq_scale_train = 1 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: rope_finetuned = unknown Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_d_conv = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_d_inner = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_d_state = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_dt_rank = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model type = 8B Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model ftype = Q4_0 Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model params = 8.03 B Jan 20 09:55:34 ollama[174]: llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) Jan 20 09:55:34 ollama[174]: llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct Jan 20 09:55:34 ollama[174]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Jan 20 09:55:34 ollama[174]: llm_load_print_meta: LF token = 128 'Ä' Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOM token = 128008 '<|eom_id|>' Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOG token = 128008 '<|eom_id|>' Jan 20 09:55:34 ollama[174]: llm_load_print_meta: EOG token = 128009 '<|eot_id|>' Jan 20 09:55:34 ollama[174]: llm_load_print_meta: max token length = 256 Jan 20 09:55:34 ollama[174]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Jan 20 09:55:34 ollama[174]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Jan 20 09:55:34 ollama[174]: ggml_cuda_init: found 1 CUDA devices: Jan 20 09:55:34 ollama[174]: Device 0: NVIDIA GeForce RTX 4090 Laptop GPU, compute capability 8.9, VMM: yes Jan 20 09:55:35 ollama[174]: llm_load_tensors: ggml ctx size = 0.27 MiB Jan 20 09:55:39 ollama[174]: llm_load_tensors: offloading 32 repeating layers to GPU Jan 20 09:55:39 ollama[174]: llm_load_tensors: offloading non-repeating layers to GPU Jan 20 09:55:39 ollama[174]: llm_load_tensors: offloaded 33/33 layers to GPU Jan 20 09:55:39 ollama[174]: llm_load_tensors: CPU buffer size = 281.81 MiB Jan 20 09:55:39 ollama[174]: llm_load_tensors: CUDA0 buffer size = 4155.99 MiB Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: n_ctx = 15616 Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: n_batch = 2048 Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: n_ubatch = 512 Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: flash_attn = 0 Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: freq_base = 500000.0 Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: freq_scale = 1 Jan 20 09:55:40 ollama[174]: llama_kv_cache_init: CUDA0 KV buffer size = 1952.00 MiB Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: KV self size = 1952.00 MiB, K (f16): 976.00 MiB, V (f16): 976.00 MiB Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: CUDA0 compute buffer size = 1038.50 MiB Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: CUDA_Host compute buffer size = 38.51 MiB Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: graph nodes = 1030 Jan 20 09:55:40 ollama[174]: llama_new_context_with_model: graph splits = 2 Jan 20 09:55:40 ollama[174]: time=2025-01-20T09:55:40.352-05:00 level=INFO source=server.go:615 msg="llama runner started in 6.28 seconds" Jan 20 09:56:01 ollama[174]: [GIN] 2025/01/20 - 09:56:01 | 200 | 27.323051198s | 127.0.0.1 | POST "/api/chat" Jan 20 09:56:16 ollama[174]: [GIN] 2025/01/20 - 09:56:16 | 200 | 42.63900776s | 127.0.0.1 | POST "/api/chat" Jan 20 09:56:20 ollama[174]: [GIN] 2025/01/20 - 09:56:20 | 200 | 47.105286485s | 127.0.0.1 | POST "/api/chat" Jan 20 09:56:23 ollama[174]: [GIN] 2025/01/20 - 09:56:23 | 200 | 49.422469993s | 127.0.0.1 | POST "/api/chat" Jan 20 09:57:02 ollama[174]: [GIN] 2025/01/20 - 09:57:02 | 200 | 1m1s | 127.0.0.1 | POST "/api/chat" Jan 20 09:57:07 ollama[174]: [GIN] 2025/01/20 - 09:57:07 | 200 | 44.852374838s | 127.0.0.1 | POST "/api/chat" Jan 20 09:57:10 ollama[174]: [GIN] 2025/01/20 - 09:57:10 | 200 | 54.031440642s | 127.0.0.1 | POST "/api/chat" Jan 20 09:57:17 ollama[174]: [GIN] 2025/01/20 - 09:57:17 | 200 | 7.628685854s | 127.0.0.1 | POST "/api/chat" Jan 20 09:57:54 ollama[174]: [GIN] 2025/01/20 - 09:57:54 | 200 | 51.377588763s | 127.0.0.1 | POST "/api/chat" Jan 20 09:57:58 ollama[174]: [GIN] 2025/01/20 - 09:57:58 | 200 | 50.995044466s | 127.0.0.1 | POST "/api/chat" Jan 20 09:58:05 ollama[174]: [GIN] 2025/01/20 - 09:58:05 | 200 | 47.414462265s | 127.0.0.1 | POST "/api/chat" Jan 20 09:58:45 ollama[174]: [GIN] 2025/01/20 - 09:58:45 | 200 | 51.214899981s | 127.0.0.1 | POST "/api/chat" Jan 20 09:59:12 ollama[174]: [GIN] 2025/01/20 - 09:59:12 | 200 | 1m13s | 127.0.0.1 | POST "/api/chat" Jan 20 09:59:18 ollama[174]: [GIN] 2025/01/20 - 09:59:18 | 200 | 1m13s | 127.0.0.1 | POST "/api/chat" Jan 20 09:59:37 ollama[174]: [GIN] 2025/01/20 - 09:59:37 | 200 | 51.817729012s | 127.0.0.1 | POST "/api/chat" Jan 20 10:00:11 ollama[174]: [GIN] 2025/01/20 - 10:00:11 | 200 | 59.507936464s | 127.0.0.1 | POST "/api/chat" Jan 20 10:00:22 ollama[174]: [GIN] 2025/01/20 - 10:00:22 | 200 | 1m3s | 127.0.0.1 | POST "/api/chat" Jan 20 10:00:52 ollama[174]: [GIN] 2025/01/20 - 10:00:52 | 200 | 1m15s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:05 ollama[174]: [GIN] 2025/01/20 - 10:01:05 | 200 | 42.657578432s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:10 ollama[174]: [GIN] 2025/01/20 - 10:01:10 | 200 | 58.529013717s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:16 ollama[174]: [GIN] 2025/01/20 - 10:01:16 | 200 | 23.427177508s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:23 ollama[174]: [GIN] 2025/01/20 - 10:01:23 | 200 | 17.708491186s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:32 ollama[174]: [GIN] 2025/01/20 - 10:01:32 | 200 | 22.494067591s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:37 ollama[174]: [GIN] 2025/01/20 - 10:01:37 | 200 | 21.82285361s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:41 ollama[174]: [GIN] 2025/01/20 - 10:01:41 | 200 | 18.875235561s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:50 ollama[174]: [GIN] 2025/01/20 - 10:01:50 | 200 | 17.608728526s | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:51 ollama[174]: panic: failed to decode batch: could not find a kv cache slot Jan 20 10:01:51 ollama[174]: goroutine 7 [running]: Jan 20 10:01:51 ollama[174]: main.(*Server).run(0xc0000ec120, {0x55e6900a79a0, 0xc0000c20a0}) Jan 20 10:01:51 ollama[174]: github.com/ollama/ollama/llama/runner/runner.go:344 +0x23e Jan 20 10:01:51 ollama[174]: created by main.main in goroutine 1 Jan 20 10:01:51 ollama[174]: github.com/ollama/ollama/llama/runner/runner.go:980 +0xd3e Jan 20 10:01:51 ollama[174]: [GIN] 2025/01/20 - 10:01:51 | 500 | 753.826293ms | 127.0.0.1 | POST "/api/chat" Jan 20 10:01:51 ollama[174]: [GIN] 2025/01/20 - 10:01:51 | 500 | 5m30s | 127.0.0.1 | POST "/api/chat" ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.1
GiteaMirror added the bug label 2026-04-12 16:42:35 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 21, 2025):

Try increasing context size (num_ctx).

<!-- gh-comment-id:2603950974 --> @rick-github commented on GitHub (Jan 21, 2025): Try increasing context size (`num_ctx`).
Author
Owner

@user-33948 commented on GitHub (Jan 24, 2025):

@rick-github Will do, what would you recommend setting it to?

I have tried previously doing this and it didn't work but can try again with whatever you recommend.

<!-- gh-comment-id:2611387786 --> @user-33948 commented on GitHub (Jan 24, 2025): @rick-github Will do, what would you recommend setting it to? I have tried previously doing this and it didn't work but can try again with whatever you recommend.
Author
Owner

@rick-github commented on GitHub (Jan 24, 2025):

It depends on what data you sending. From the logs, you have it set to 15616. You have a GeForce RTX 4090, yet ollama is sometimes spending over a minute processing data. This implies quite a large prompt. If nothing else is using your GPU you'd have about 15G free, so you could bump num_ctx to 80000 without too much trouble. If you enable verbose debugging with OLLAMA_DEBUG=1 there may be information about the processing that the ollama runner is doing that might shed light on why it can't find a kv cache slot.

<!-- gh-comment-id:2612846756 --> @rick-github commented on GitHub (Jan 24, 2025): It depends on what data you sending. From the logs, you have it set to 15616. You have a GeForce RTX 4090, yet ollama is sometimes spending over a minute processing data. This implies quite a large prompt. If nothing else is using your GPU you'd have about 15G free, so you could bump `num_ctx` to 80000 without too much trouble. If you enable verbose debugging with `OLLAMA_DEBUG=1` there may be information about the processing that the ollama runner is doing that might shed light on why it can't find a kv cache slot.
Author
Owner

@viba1 commented on GitHub (Jan 26, 2025):

Hi all,

I also encounter a "Error: POST predict: Post "http://127.0.0.1:46415/completion": EOF", using phi4, wich is a 14b model, but don't know if linked to previous.

I changed num_ctx to 8192, and obtain the same problem.

ollama.log

See complete DEBUG logs in attachment.

I don't encounters the problem with other 14b models, deepseek-r1:14b or gemma2:27b which are both using CPU and GPU.

AMD GPU
GTX 980Ti (6Gb VRAM)
32Gb DDRAM
DEBIAN 12
NVIDIA DRIVER is 535.216.01
CUDA DRIVER is 12.2

<!-- gh-comment-id:2614399861 --> @viba1 commented on GitHub (Jan 26, 2025): Hi all, I also encounter a "Error: POST predict: Post "http://127.0.0.1:46415/completion": EOF", using phi4, wich is a 14b model, but don't know if linked to previous. I changed num_ctx to 8192, and obtain the same problem. [ollama.log](https://github.com/user-attachments/files/18550036/ollama.log) See complete DEBUG logs in attachment. I don't encounters the problem with other 14b models, deepseek-r1:14b or gemma2:27b which are both using CPU and GPU. AMD GPU GTX 980Ti (6Gb VRAM) 32Gb DDRAM DEBIAN 12 NVIDIA DRIVER is 535.216.01 CUDA DRIVER is 12.2
Author
Owner

@rick-github commented on GitHub (Jan 26, 2025):

janv. 26 13:02:43  ollama[1393]: time=2025-01-26T13:02:43.755+01:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=20 layers.split="" memory.available="[5.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.4 GiB" memory.required.partial="5.5 GiB" memory.required.kv="800.0 MiB" memory.required.allocations="[5.5 GiB]" memory.weights.total="8.5 GiB" memory.weights.repeating="8.2 GiB" memory.weights.nonrepeating="402.0 MiB" memory.graph.full="533.3 MiB" memory.graph.partial="533.3 MiB"

This is a different problem. You have 5.5GiB free, the model needs 10.4GiB, so ollama is doing a partial load. It's allocating 5.5GiB, ie taking all VRAM for the initial model load. This doesn't leave enough room for transient allocations and the runner OOMs. There are some migitations you can take:

  1. Set OLLAMA_GPU_OVERHEAD to give llama.cpp a buffer to grow in to (eg, OLLAMA_GPU_OVERHEAD=536870912 to reserve 512M)
  2. Enable flash attention by setting OLLAMA_FLASH_ATTENTION=1 in the server environment. Flash attention is a more efficient use of memory and may reduce memory pressure.
  3. Reduce the number layers that ollama thinks it can offload to the GPU, see here. Ollama is currently offloading 20 layers, try setting num_gpu to 15.
  4. Set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1. This will allow the GPU to offload to CPU memory if VRAM is exhausted. This is only useful for small amounts of memory as there is a performance penalty. However, in the case where the goal is to reduce OOMs, the amount offloaded will be small and the impact minimal.
<!-- gh-comment-id:2614421491 --> @rick-github commented on GitHub (Jan 26, 2025): ``` janv. 26 13:02:43 ollama[1393]: time=2025-01-26T13:02:43.755+01:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=20 layers.split="" memory.available="[5.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.4 GiB" memory.required.partial="5.5 GiB" memory.required.kv="800.0 MiB" memory.required.allocations="[5.5 GiB]" memory.weights.total="8.5 GiB" memory.weights.repeating="8.2 GiB" memory.weights.nonrepeating="402.0 MiB" memory.graph.full="533.3 MiB" memory.graph.partial="533.3 MiB" ``` This is a different problem. You have 5.5GiB free, the model needs 10.4GiB, so ollama is doing a partial load. It's allocating 5.5GiB, ie taking all VRAM for the initial model load. This doesn't leave enough room for transient allocations and the runner OOMs. There are some migitations you can take: 1. Set [`OLLAMA_GPU_OVERHEAD`](https://github.com/ollama/ollama/blob/5f8051180e3b9aeafc153f6b5056e7358a939c88/envconfig/config.go#L237) to give llama.cpp a buffer to grow in to (eg, `OLLAMA_GPU_OVERHEAD=536870912` to reserve 512M) 2. Enable flash attention by setting [`OLLAMA_FLASH_ATTENTION=1`](https://github.com/ollama/ollama/blob/5f8051180e3b9aeafc153f6b5056e7358a939c88/envconfig/config.go#L236) in the server environment. Flash attention is a more efficient use of memory and may reduce memory pressure. 3. Reduce the number layers that ollama thinks it can offload to the GPU, see [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650). Ollama is currently offloading 20 layers, try setting `num_gpu` to 15. 4. Set `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`. This will allow the GPU to offload to CPU memory if VRAM is exhausted. This is only useful for small amounts of memory as there is a [performance penalty](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900). However, in the case where the goal is to reduce OOMs, the amount offloaded will be small and the impact minimal.
Author
Owner

@liuliwei91 commented on GitHub (Jan 27, 2025):

I'm getting the same error::ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:40457/completion": EOF;

Don't use "systemctl" to start ollama - use "ollama serve" instead. The services started by these two methods are isolated, so the previously downloaded models cannot be found.

<!-- gh-comment-id:2615283559 --> @liuliwei91 commented on GitHub (Jan 27, 2025): I'm getting the same error::ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:40457/completion": EOF; Don't use "systemctl" to start ollama - use "ollama serve" instead. The services started by these two methods are isolated, so the previously downloaded models cannot be found.
Author
Owner

@rick-github commented on GitHub (Jan 27, 2025):

Your error is not the same as the original post. Create a new issue and add server logs.

<!-- gh-comment-id:2615302078 --> @rick-github commented on GitHub (Jan 27, 2025): Your error is not the same as the original post. Create a new issue and add [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues).
Author
Owner

@user-33948 commented on GitHub (Feb 1, 2025):

It depends on what data you sending. From the logs, you have it set to 15616. You have a GeForce RTX 4090, yet ollama is sometimes spending over a minute processing data. This implies quite a large prompt. If nothing else is using your GPU you'd have about 15G free, so you could bump num_ctx to 80000 without too much trouble. If you enable verbose debugging with OLLAMA_DEBUG=1 there may be information about the processing that the ollama runner is doing that might shed light on why it can't find a kv cache slot.

I tried changing the num_ctx but when I look in the Ollama logs, it's clear the model didn't actually accept an updated context window. See code below:

os.environ['OLLAMA_DEBUG'] = '1'

kg_extractor = SchemaLLMPathExtractor(
llm= Ollama(model="llama3.1", num_ctx=80000, json_mode=True, request_timeout=3600),
possible_entities=entities,
possible_relations=relations,
kg_validation_schema=validation_schema,
strict=True,
)

After running this and checking the logs, I still get:
ollama[186]: llama_new_context_with_model: n_ctx = 15616

Request help updating context window. Thanks.

<!-- gh-comment-id:2628997766 --> @user-33948 commented on GitHub (Feb 1, 2025): > It depends on what data you sending. From the logs, you have it set to 15616. You have a GeForce RTX 4090, yet ollama is sometimes spending over a minute processing data. This implies quite a large prompt. If nothing else is using your GPU you'd have about 15G free, so you could bump `num_ctx` to 80000 without too much trouble. If you enable verbose debugging with `OLLAMA_DEBUG=1` there may be information about the processing that the ollama runner is doing that might shed light on why it can't find a kv cache slot. I tried changing the num_ctx but when I look in the Ollama logs, it's clear the model didn't actually accept an updated context window. See code below: os.environ['OLLAMA_DEBUG'] = '1' kg_extractor = SchemaLLMPathExtractor( llm= Ollama(model="llama3.1", num_ctx=80000, json_mode=True, request_timeout=3600), possible_entities=entities, possible_relations=relations, kg_validation_schema=validation_schema, strict=True, ) After running this and checking the logs, I still get: ollama[186]: llama_new_context_with_model: n_ctx = 15616 Request help updating context window. Thanks.
Author
Owner

@rick-github commented on GitHub (Feb 6, 2025):

You have to set OLLAMA_DEBUG=1 in the server environment.

llm= Ollama(model="llama3.1", num_ctx=80000, json_mode=True, request_timeout=3600),

You don't give enough information to determine how your client is using the framework. Where is Ollama defined?

<!-- gh-comment-id:2639818759 --> @rick-github commented on GitHub (Feb 6, 2025): You have to set `OLLAMA_DEBUG=1` in the server environment. > llm= Ollama(model="llama3.1", num_ctx=80000, json_mode=True, request_timeout=3600), You don't give enough information to determine how your client is using the framework. Where is `Ollama` defined?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5481