[GH-ISSUE #3583] Small Context Size (n_ctx) leads to crashes and log-file explosion #48726

Closed
opened 2026-04-28 09:09:29 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @TheMasterFX on GitHub (Apr 10, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3583

What is the issue?

I came across this error by mistake. I wanted to reduce the max predict tokens to 64 for Codegeneration. I wrongly used "n_ctx" instead of "num_predict". The result was that after a couple of tries the ollama server doesn't respond anymore, and after a couple of minutes the VRAM was freed and sonetime the ollama log file (server.log) became > 4 GB
I could reproduce it with almost every model (1.8-7b)

What did you expect to see?

Ollama doesn't crash and the log file should not increase in that significance.

Steps to reproduce

Create a Python Script with the following content:

from ollama import Client

client = Client('http://localhost:11434')

response = client.generate(model='mistral:latest', prompt='Write a poem about why is the sky blue?', options={"n_ctx": 64})
tokens_per_second = response['eval_count'] / (response['eval_duration'] / 1000000000)
print(f'{request_number}: {tokens_per_second} - {response["response"]}')

After a couple of runs you should see it is hanging.
The Log-File then looks like:

.............................................................................................
llama_new_context_with_model: n_ctx      = 128
llama_new_context_with_model: n_batch    = 128
llama_new_context_with_model: n_ubatch   = 128
llama_new_context_with_model: freq_base  = 999999.4
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =     3.75 MiB
llama_new_context_with_model: KV self size  =    3.75 MiB, K (f16):    1.88 MiB, V (f16):    1.88 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =    25.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    25.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     1.56 MiB
llama_new_context_with_model: graph nodes  = 1175
llama_new_context_with_model: graph splits = 2
[1712606577] warming up the model with an empty run
{"function":"initialize","level":"INFO","line":444,"msg":"initializing slots","n_slots":1,"tid":"5396","timestamp":1712606577}
{"function":"initialize","level":"INFO","line":456,"msg":"new slot","n_ctx_slot":128,"slot_id":0,"tid":"5396","timestamp":1712606577}
time=2024-04-08T22:02:57.185+02:00 level=INFO source=dyn_ext_server.go:159 msg="Starting llama main loop"
time=2024-04-08T22:02:57.185+02:00 level=DEBUG source=routes.go:249 msg="generate handler" prompt="Why is the sky blue?"
time=2024-04-08T22:02:57.186+02:00 level=DEBUG source=routes.go:250 msg="generate handler" template="{{ .Prompt }}"
time=2024-04-08T22:02:57.186+02:00 level=DEBUG source=routes.go:251 msg="generate handler" system=""
time=2024-04-08T22:02:57.186+02:00 level=DEBUG source=routes.go:282 msg="generate handler" prompt="Why is the sky blue?"
[1712606577] llama server main loop starting
{"function":"update_slots","level":"INFO","line":1572,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"17792","timestamp":1712606577}
{"function":"launch_slot_with_data","level":"INFO","line":829,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606577}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1810,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":6,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606577}
{"function":"update_slots","level":"INFO","line":1834,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606577}
{"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606578}
{"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606578}
{"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606579}
{"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606579}
{"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606580}

....MANY OF THEM....

{"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606596}
{"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606596}
{"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606597}
{"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606597}
{"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606598}
{"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606598}
{"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606599}
{"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606599}
{"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":64,"n_keep":0,"n_left":128,"n_past":128,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606600}
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
[1712606600] update_slots : failed to decode the batch, n_batch = 1, ret = 1
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
[1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
[1712606600] update_slots : failed to decode the batch, n_batch = 1, ret = 1

After that the log file is filled with "update_slots : failed to find free space in the KV cache" until you kill the ollama process. (The Env. OLLAMA_DEBUG=1)

Are there any recent changes that introduced the issue?

No response

OS

Windows

Architecture

amd64

Platform

No response

Ollama version

0.1.31

GPU

Nvidia

GPU info

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 546.01                 Driver Version: 546.01       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080      WDDM  | 00000000:02:00.0  On |                  N/A |
|  0%   48C    P8              27W / 370W |   1720MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

CPU

Intel

Other software

No response

Originally created by @TheMasterFX on GitHub (Apr 10, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3583 ### What is the issue? I came across this error by mistake. I wanted to reduce the max predict tokens to 64 for Codegeneration. I wrongly used "n_ctx" instead of "num_predict". The result was that after a couple of tries the ollama server doesn't respond anymore, and after a couple of minutes the VRAM was freed and sonetime the ollama log file (server.log) became > 4 GB I could reproduce it with almost every model (1.8-7b) ### What did you expect to see? Ollama doesn't crash and the log file should not increase in that significance. ### Steps to reproduce Create a Python Script with the following content: ``` from ollama import Client client = Client('http://localhost:11434') response = client.generate(model='mistral:latest', prompt='Write a poem about why is the sky blue?', options={"n_ctx": 64}) tokens_per_second = response['eval_count'] / (response['eval_duration'] / 1000000000) print(f'{request_number}: {tokens_per_second} - {response["response"]}') ``` After a couple of runs you should see it is hanging. The Log-File then looks like: ``` ............................................................................................. llama_new_context_with_model: n_ctx = 128 llama_new_context_with_model: n_batch = 128 llama_new_context_with_model: n_ubatch = 128 llama_new_context_with_model: freq_base = 999999.4 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 3.75 MiB llama_new_context_with_model: KV self size = 3.75 MiB, K (f16): 1.88 MiB, V (f16): 1.88 MiB llama_new_context_with_model: CUDA_Host output buffer size = 25.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 25.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 1.56 MiB llama_new_context_with_model: graph nodes = 1175 llama_new_context_with_model: graph splits = 2 [1712606577] warming up the model with an empty run {"function":"initialize","level":"INFO","line":444,"msg":"initializing slots","n_slots":1,"tid":"5396","timestamp":1712606577} {"function":"initialize","level":"INFO","line":456,"msg":"new slot","n_ctx_slot":128,"slot_id":0,"tid":"5396","timestamp":1712606577} time=2024-04-08T22:02:57.185+02:00 level=INFO source=dyn_ext_server.go:159 msg="Starting llama main loop" time=2024-04-08T22:02:57.185+02:00 level=DEBUG source=routes.go:249 msg="generate handler" prompt="Why is the sky blue?" time=2024-04-08T22:02:57.186+02:00 level=DEBUG source=routes.go:250 msg="generate handler" template="{{ .Prompt }}" time=2024-04-08T22:02:57.186+02:00 level=DEBUG source=routes.go:251 msg="generate handler" system="" time=2024-04-08T22:02:57.186+02:00 level=DEBUG source=routes.go:282 msg="generate handler" prompt="Why is the sky blue?" [1712606577] llama server main loop starting {"function":"update_slots","level":"INFO","line":1572,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"17792","timestamp":1712606577} {"function":"launch_slot_with_data","level":"INFO","line":829,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606577} {"function":"update_slots","ga_i":0,"level":"INFO","line":1810,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":6,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606577} {"function":"update_slots","level":"INFO","line":1834,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606577} {"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606578} {"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606578} {"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606579} {"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606579} {"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606580} ....MANY OF THEM.... {"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606596} {"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606596} {"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606597} {"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606597} {"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606598} {"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606598} {"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606599} {"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":63,"n_keep":0,"n_left":127,"n_past":127,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606599} {"function":"update_slots","level":"INFO","line":1605,"msg":"slot context shift","n_cache_tokens":128,"n_ctx":128,"n_discard":64,"n_keep":0,"n_left":128,"n_past":128,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"17792","timestamp":1712606600} [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1 [1712606600] update_slots : failed to decode the batch, n_batch = 1, ret = 1 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2 [1712606600] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1 [1712606600] update_slots : failed to decode the batch, n_batch = 1, ret = 1 ``` After that the log file is filled with "update_slots : failed to find free space in the KV cache" until you kill the ollama process. (The Env. OLLAMA_DEBUG=1) ### Are there any recent changes that introduced the issue? _No response_ ### OS Windows ### Architecture amd64 ### Platform _No response_ ### Ollama version 0.1.31 ### GPU Nvidia ### GPU info ``` +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 546.01 Driver Version: 546.01 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3080 WDDM | 00000000:02:00.0 On | N/A | | 0% 48C P8 27W / 370W | 1720MiB / 10240MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ ``` ### CPU Intel ### Other software _No response_
GiteaMirror added the bug label 2026-04-28 09:09:29 -05:00
Author
Owner

@jmorganca commented on GitHub (Apr 15, 2024):

Hi there this should be fixed as of 0.1.31 - are you running this version? Thanks for the issue

<!-- gh-comment-id:2057651206 --> @jmorganca commented on GitHub (Apr 15, 2024): Hi there this should be fixed as of 0.1.31 - are you running this version? Thanks for the issue
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#48726