[GH-ISSUE #3912] Server hang after ~400 long context requests mixtral or llama3 ollama 0.1.32 #28184

Closed
opened 2026-04-22 06:03:10 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @kungfu-eric on GitHub (Apr 25, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3912

What is the issue?

Hangs after about 400 long context requests on mixtral and same with llama3

ollama --version
ollama version is 0.1.32

This is on AMD CPU, 2x NVIDIA A6000s, Ubuntu 18.04 in a docker container. Client is using the python ollama package. Workaround by restarting server manually and using asyncio.wait_for in the client.

Please give 0.1.32 a try and let us know if you're still seeing unrecoverable hangs.

The hang continues to output this on the ollama server but no response is given to the client:

{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056803}
{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056823}
{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056843}

Maybe related to https://github.com/ollama/ollama/issues/1863

OS

Linux, Docker

GPU

Nvidia

CPU

AMD

Ollama version

0.1.32

Originally created by @kungfu-eric on GitHub (Apr 25, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3912 ### What is the issue? Hangs after about 400 long context requests on mixtral and same with llama3 ``` ollama --version ollama version is 0.1.32 ``` This is on AMD CPU, 2x NVIDIA A6000s, Ubuntu 18.04 in a docker container. Client is using the python ollama package. Workaround by restarting server manually and using asyncio.wait_for in the client. > Please give 0.1.32 a try and let us know if you're still seeing unrecoverable hangs. The hang continues to output this on the ollama server but no response is given to the client: ``` {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056803} {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056823} {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056843} ``` Maybe related to https://github.com/ollama/ollama/issues/1863 ### OS Linux, Docker ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.32
GiteaMirror added the bug label 2026-04-22 06:03:10 -05:00
Author
Owner

@newink commented on GitHub (Apr 26, 2024):

Got the same behavior with 4090, dockerized on ubuntu 22.04 after same single (pretty big though) request.

<!-- gh-comment-id:2079061285 --> @newink commented on GitHub (Apr 26, 2024): Got the same behavior with 4090, dockerized on ubuntu 22.04 after same single (pretty big though) request.
Author
Owner

@frederick-wang commented on GitHub (Apr 27, 2024):

Got the same behavior with A100 on Ubuntu 22.04. ollama version is 0.1.32.

<!-- gh-comment-id:2081097180 --> @frederick-wang commented on GitHub (Apr 27, 2024): Got the same behavior with A100 on Ubuntu 22.04. ollama version is 0.1.32.
Author
Owner

@jmorganca commented on GitHub (May 9, 2024):

Hi folks this should be fixed in 0.1.33 – a generation limit was added to account for rare cases where the model infinitely generates tokens. Note: more improvements are coming around this too

<!-- gh-comment-id:2103552393 --> @jmorganca commented on GitHub (May 9, 2024): Hi folks this should be fixed in 0.1.33 – a generation limit was added to account for rare cases where the model infinitely generates tokens. Note: more improvements are coming around this too
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#28184