[GH-ISSUE #1863] Ollama stuck after few runs #1063

New Issue

@iplayfast commented on GitHub (Jan 9, 2024):

I think this is realted to https://github.com/jmorganca/ollama/issues/1691

@iplayfast commented on GitHub (Jan 9, 2024): I think this is realted to https://github.com/jmorganca/ollama/issues/1691

GiteaMirror commented

@IAMBUDE commented on GitHub (Jan 9, 2024):

I also experience this issue with 2x 3090 GPUs. The server just stops generating.

@IAMBUDE commented on GitHub (Jan 9, 2024): I also experience this issue with 2x 3090 GPUs. The server just stops generating.

GiteaMirror commented

@jadhvank commented on GitHub (Jan 10, 2024):

I updated the Ollama to version 0.1.19 and the stuck happened again in 5 min.
Removed the 0.1.19 and installed 0.1.16.
The stuck occurred after 6 hours (better!)

@jadhvank commented on GitHub (Jan 10, 2024): I updated the Ollama to version 0.1.19 and the stuck happened again in 5 min. Removed the 0.1.19 and installed 0.1.16. The stuck occurred after 6 hours (better!)

GiteaMirror commented

2026-04-12 10:48:18 -05:00

@EmanueleLenzi92 commented on GitHub (Jan 17, 2024):

I think I have the same problem. After a few runs, the ollama server crashes and stops to generate text. I'm using windows 11 (wsl ubuntu) and langchain. I have a rtx 4090 and I tried from 0.1.16 to 0.1.19, but all of them have this issue in my case.
instead, on a laptop with windows 10 and with an nvidia T500, I don't have this problem.

@EmanueleLenzi92 commented on GitHub (Jan 17, 2024): I think I have the same problem. After a few runs, the ollama server crashes and stops to generate text. I'm using windows 11 (wsl ubuntu) and langchain. I have a rtx 4090 and I tried from 0.1.16 to 0.1.19, but all of them have this issue in my case. instead, on a laptop with windows 10 and with an nvidia T500, I don't have this problem.

GiteaMirror commented

@hml-github commented on GitHub (Jan 18, 2024):

me too, same problem, stop generation after random time.

@hml-github commented on GitHub (Jan 18, 2024): me too, same problem, stop generation after random time.

GiteaMirror commented

2026-04-12 10:48:18 -05:00

@amirdeljouyi commented on GitHub (Jan 24, 2024):

Similarly, it halts after approximately 100 iterations.

@amirdeljouyi commented on GitHub (Jan 24, 2024): Similarly, it halts after approximately 100 iterations.

GiteaMirror commented

2026-04-12 10:48:19 -05:00

@mchiang0610 commented on GitHub (Jan 27, 2024):

wanted to see if anyone is still running into this issue with ollama v0.1.22

@mchiang0610 commented on GitHub (Jan 27, 2024): wanted to see if anyone is still running into this issue with ollama v0.1.22

GiteaMirror commented

2026-04-12 10:48:19 -05:00

@EmanueleLenzi92 commented on GitHub (Feb 2, 2024):

wanted to see if anyone is still running into this issue with ollama v0.1.22

I confirm i still have this problem with 0.1.22

@EmanueleLenzi92 commented on GitHub (Feb 2, 2024): > wanted to see if anyone is still running into this issue with ollama v0.1.22 I confirm i still have this problem with 0.1.22

GiteaMirror commented

2026-04-12 10:48:19 -05:00

@julienlesbegueriesperso commented on GitHub (Feb 2, 2024):

wanted to see if anyone is still running into this issue with ollama v0.1.22

I confirm i still have this problem with 0.1.22

I confirm also (on MacBook Pro 2,6 GHz Intel Core i7 and on a cpu-only server)

@julienlesbegueriesperso commented on GitHub (Feb 2, 2024): > > wanted to see if anyone is still running into this issue with ollama v0.1.22 > > I confirm i still have this problem with 0.1.22 I confirm also (on MacBook Pro 2,6 GHz Intel Core i7 and on a cpu-only server)

GiteaMirror commented

2026-04-12 10:48:20 -05:00

@Simaky commented on GitHub (Feb 2, 2024):

I could confirm that issue with 0.1.23 (on WSL)
I ran the script with 100 requests and saw in the logs that 6/10 requests were frozen and never received a response :(

@Simaky commented on GitHub (Feb 2, 2024): I could confirm that issue with 0.1.23 (on WSL) I ran the script with 100 requests and saw in the logs that 6/10 requests were frozen and never received a response :(

GiteaMirror commented

2026-04-12 10:48:20 -05:00

@svilupp commented on GitHub (Feb 8, 2024):

+1

I run a community leaderboard for Julia code generation and I've run 10s of thousands of samples in the past (with failures, but not unreasonable).
Recently, I've updated and haven't been able to run anything anymore... Same machine/setup

Behavior:

I updated to 0.1.2x (not sure) and I couldn't run more than a few samples of Qwen-72b FP16. It kept freezing, it would stop using GPU, etc
Updated to 0.1.23 -> cannot run more than 1-2 samples and it just hangs (need to kill process)
Tried 0.1.24 pre-release -> same as above
Tried 0.1.19 -> better, I could run c. 100 samples and then it froze
I'm now on 0.1.16 and it's been a few hundred samples and it's still going. So clearly the issue didn't exist back then!

Workload:

This benchmark: https://github.com/svilupp/Julia-LLM-Leaderboard
Different models, prompts, test cases, but never have any empty or unexpected inputs (the workload hasn't changed, it just doesn't run anymore). I'm even running models that ran fine previously
Using ollama via /api/chat endpoint

System:

Debian-based x86
4x RTX 4090

Ollama header:

2024/02/08 17:20:55 images.go:857: INFO total blobs: 140
2024/02/08 17:20:55 images.go:864: INFO total unused blobs removed: 0
2024/02/08 17:20:55 routes.go:950: INFO Listening on 127.0.0.1:11434 (version 0.1.22)
2024/02/08 17:20:55 payload_common.go:106: INFO Extracting dynamic libraries...
2024/02/08 17:20:57 payload_common.go:145: INFO Dynamic LLM libraries [rocm_v6 cpu rocm_v5 cuda_v11 cpu_avx2 cpu_avx]
2024/02/08 17:20:57 gpu.go:94: INFO Detecting GPU type
2024/02/08 17:20:57 gpu.go:236: INFO Searching for GPU management library libnvidia-ml.so
2024/02/08 17:20:57 gpu.go:282: INFO Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.86.10]
2024/02/08 17:20:57 gpu.go:99: INFO Nvidia GPU detected
2024/02/08 17:20:57 gpu.go:140: INFO CUDA Compute Capability detected: 8.9

@svilupp commented on GitHub (Feb 8, 2024): +1 I run a community leaderboard for Julia code generation and I've run 10s of thousands of samples in the past (with failures, but not unreasonable). Recently, I've updated and haven't been able to run anything anymore... Same machine/setup **Behavior:** - I updated to 0.1.2x (not sure) and I couldn't run more than a few samples of Qwen-72b FP16. It kept freezing, it would stop using GPU, etc - Updated to 0.1.23 -> cannot run more than 1-2 samples and it just hangs (need to kill process) - Tried 0.1.24 pre-release -> same as above - Tried 0.1.19 -> better, I could run c. 100 samples and then it froze - I'm now on 0.1.16 and it's been a few hundred samples and it's still going. So clearly the issue didn't exist back then! **Workload:** - This benchmark: https://github.com/svilupp/Julia-LLM-Leaderboard - Different models, prompts, test cases, but never have any empty or unexpected inputs (the workload hasn't changed, it just doesn't run anymore). I'm even running models that ran fine previously - Using ollama via /api/chat endpoint **System:** - Debian-based x86 - 4x RTX 4090 **Ollama header:** > 2024/02/08 17:20:55 images.go:857: INFO total blobs: 140 > 2024/02/08 17:20:55 images.go:864: INFO total unused blobs removed: 0 > 2024/02/08 17:20:55 routes.go:950: INFO Listening on 127.0.0.1:11434 (version 0.1.22) > 2024/02/08 17:20:55 payload_common.go:106: INFO Extracting dynamic libraries... > 2024/02/08 17:20:57 payload_common.go:145: INFO Dynamic LLM libraries [rocm_v6 cpu rocm_v5 cuda_v11 cpu_avx2 cpu_avx] > 2024/02/08 17:20:57 gpu.go:94: INFO Detecting GPU type > 2024/02/08 17:20:57 gpu.go:236: INFO Searching for GPU management library libnvidia-ml.so > 2024/02/08 17:20:57 gpu.go:282: INFO Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.86.10] > 2024/02/08 17:20:57 gpu.go:99: INFO Nvidia GPU detected > 2024/02/08 17:20:57 gpu.go:140: INFO CUDA Compute Capability detected: 8.9

GiteaMirror commented

@wac81 commented on GitHub (Feb 13, 2024):

Hi @jadhvank sorry you hit this, looking into it嗨，抱歉你碰到了这个，正在调查它

In the meantime an easy way to install 0.1.17 is同时安装 0.1.17 的简单方法是
curl https://ollama.ai/install.sh | sed 's#https://ollama.ai/download#https://github.com/jmorganca/ollama/releases/download/v0.1.17#' | sh

Could it have anything to do with GPU memory management?

My experience is that if you use a 12g gpu to load the llama13b model, the output will basically get stuck if it exceeds 200 tokens.

@wac81 commented on GitHub (Feb 13, 2024): > Hi @jadhvank sorry you hit this, looking into it嗨，抱歉你碰到了这个，正在调查它 > > In the meantime an easy way to install `0.1.17` is同时安装 `0.1.17` 的简单方法是 > > ``` > curl https://ollama.ai/install.sh | sed 's#https://ollama.ai/download#https://github.com/jmorganca/ollama/releases/download/v0.1.17#' | sh > ``` Could it have anything to do with GPU memory management? My experience is that if you use a 12g gpu to load the llama13b model, the output will basically get stuck if it exceeds 200 tokens.

GiteaMirror commented

@jmorganca commented on GitHub (Feb 20, 2024):

This should be fixed as of 0.1.24. Please let me know if that isn't the case, and we'll re-open this (and get it fixed once and for all 😊). Sorry about this!

@jmorganca commented on GitHub (Feb 20, 2024): This should be fixed as of 0.1.24. Please let me know if that isn't the case, and we'll re-open this (and get it fixed once and for all 😊). Sorry about this!

GiteaMirror commented

@StrikerRUS commented on GitHub (Feb 20, 2024):

@jmorganca Unfortunately, it isn't fixed in 0.1.25.

OS: Ubuntu 22.04.2 LTS
GPU: NVIDIA RTX A6000 (Driver Version: 530.41.03, CUDA Version: 12.1)
Model: Tested mixtral:8x7b-instruct-v0.1-q4_K_M, mixtral:8x7b-instruct-v0.1-q6_K, llama2:7b-chat-q4_0
Env: Official Docker

/api/generate and /api/chat hangs complitely while version or tags info works well.
Even docker compose restart doesn't help, only complete down + up helps.

Observed this behavior sometimes with 0.1.23, but 0.1.25 makes things even worse - hangs approximately every hour.

@StrikerRUS commented on GitHub (Feb 20, 2024): @jmorganca Unfortunately, it isn't fixed in 0.1.25. OS: Ubuntu 22.04.2 LTS GPU: NVIDIA RTX A6000 (Driver Version: 530.41.03, CUDA Version: 12.1) Model: Tested `mixtral:8x7b-instruct-v0.1-q4_K_M`, `mixtral:8x7b-instruct-v0.1-q6_K`, `llama2:7b-chat-q4_0` Env: Official Docker `/api/generate` and `/api/chat` hangs complitely while version or tags info works well. Even `docker compose restart` doesn't help, only complete `down + up` helps. Observed this behavior sometimes with 0.1.23, but 0.1.25 makes things even worse - hangs approximately every hour.

GiteaMirror commented

2026-04-12 10:48:22 -05:00

@calebdel commented on GitHub (Feb 21, 2024):

@jmorganca, Likewise still seeing this issue after a small number of iterations on v0.1.25

@calebdel commented on GitHub (Feb 21, 2024): @jmorganca, Likewise still seeing this issue after a small number of iterations on v0.1.25

GiteaMirror commented

@EmanueleLenzi92 commented on GitHub (Feb 22, 2024):

I think I have the same problem. After a few runs, the ollama server crashes and stops to generate text. I'm using windows 11 (wsl ubuntu) and langchain. I have a rtx 4090 and I tried from 0.1.16 to 0.1.19, but all of them have this issue in my case. instead, on a laptop with windows 10 and with an nvidia T500, I don't have this problem.

I confirm this problem with 0.1.25 and 0.1.26

@EmanueleLenzi92 commented on GitHub (Feb 22, 2024): > I think I have the same problem. After a few runs, the ollama server crashes and stops to generate text. I'm using windows 11 (wsl ubuntu) and langchain. I have a rtx 4090 and I tried from 0.1.16 to 0.1.19, but all of them have this issue in my case. instead, on a laptop with windows 10 and with an nvidia T500, I don't have this problem. I confirm this problem with 0.1.25 and 0.1.26

GiteaMirror commented

2026-04-12 10:48:22 -05:00

@StrikerRUS commented on GitHub (Feb 23, 2024):

@jmorganca Can you please reopen this issue?

@StrikerRUS commented on GitHub (Feb 23, 2024): @jmorganca Can you please reopen this issue?

GiteaMirror commented

2026-04-12 10:48:22 -05:00

@BEpresent commented on GitHub (Feb 23, 2024):

Same here, issue still persists on fresh install (calling multiple times in a loop).

@BEpresent commented on GitHub (Feb 23, 2024): Same here, issue still persists on fresh install (calling multiple times in a loop).

GiteaMirror commented

2026-04-12 10:48:23 -05:00

@ArjonBu commented on GitHub (Feb 24, 2024):

I am seeing this with 0.1.27 running on docker on linux. Docker has a limit of 8GB of RAM but the container is using only 1.

The container just hangs and shows nothing in logs. I am using open-webui as a frontend.

@ArjonBu commented on GitHub (Feb 24, 2024): I am seeing this with 0.1.27 running on docker on linux. Docker has a limit of 8GB of RAM but the container is using only 1. The container just hangs and shows nothing in logs. I am using open-webui as a frontend.

GiteaMirror commented

2026-04-12 10:48:23 -05:00

@julienlesbegueriesperso commented on GitHub (Feb 25, 2024):

I confirm alors on 0.1.27 on Mac OS X, Fedora with GPU (RTX), and Ubuntu (without GPU). In a fastapi + langchain env with 2 endpoints invoking 2 different ollama models , after I succeed in receiving responses from the first endpoint, I'm stuck when I try the 2nd endpoint. I have to restart the ollama service to see my response.

@julienlesbegueriesperso commented on GitHub (Feb 25, 2024): I confirm alors on 0.1.27 on Mac OS X, Fedora with GPU (RTX), and Ubuntu (without GPU). In a fastapi + langchain env with 2 endpoints invoking 2 different ollama models , after I succeed in receiving responses from the first endpoint, I'm stuck when I try the 2nd endpoint. I have to restart the ollama service to see my response.

GiteaMirror commented

2026-04-12 10:48:24 -05:00

@ytlai1985 commented on GitHub (Feb 27, 2024):

I confirm that this problem occurs with versions 0.1.24 and 0.1.27. After adding a prompt about the output limitation, it seems to be resolved.
Does that mean no [EOS] token has been generated? Using the 'STOP' options will also resolve this problem, but sometimes it may not achieve the ideal result.

OS: Ubuntu 22.04.2 LTS
GPU: NVIDIA L4 (Driver Version: 535.154.05, CUDA Version: 12.2)
Model: Mixtral8x7b-instruct-v0.1-q5_K_M

For example
Limitation prompt

    text_to_gpt = (
        f'[INST] You are a helpful assistant. Your task is to read the following context.'
        f'You will respond with a JSON object containing the entire sentence only if it contains abnormal words,'
        f' along with the confidence score. '
        f' Display only five objects with a high confidence score (greater than 0.6).'    #  Adding limitation
        f'Abnormal words are defined as 'error', 'unknown', 'fail', 'alert', etc. Do not provide explanations.'
        f'context """\n{input_text}\n"""\n'
    )

Use options

        response = self.client.chat(
            model=self.model,
            messages=[
                {
                    'role': 'user',
                    'content': input_text,
                },
            ],
            options={
                'top_k': 10,
                'temperature': 0.6,
                'stop': ['\n']
            }
        )

@ytlai1985 commented on GitHub (Feb 27, 2024): I confirm that this problem occurs with versions 0.1.24 and 0.1.27. After adding a prompt about the output limitation, it seems to be resolved. Does that mean no [EOS] token has been generated? Using the 'STOP' options will also resolve this problem, but sometimes it may not achieve the ideal result. OS: Ubuntu 22.04.2 LTS GPU: NVIDIA L4 (Driver Version: 535.154.05, CUDA Version: 12.2) Model: Mixtral8x7b-instruct-v0.1-q5_K_M ![image](https://github.com/ollama/ollama/assets/105573237/0fdbf3a7-7dd7-4edf-9831-6d3cba76c4d3) For example Limitation prompt ``` python text_to_gpt = ( f'[INST] You are a helpful assistant. Your task is to read the following context.' f'You will respond with a JSON object containing the entire sentence only if it contains abnormal words,' f' along with the confidence score. ' f' Display only five objects with a high confidence score (greater than 0.6).' # Adding limitation f'Abnormal words are defined as 'error', 'unknown', 'fail', 'alert', etc. Do not provide explanations.' f'context """\n{input_text}\n"""\n' ) ``` Use options ```python response = self.client.chat( model=self.model, messages=[ { 'role': 'user', 'content': input_text, }, ], options={ 'top_k': 10, 'temperature': 0.6, 'stop': ['\n'] } ) ```

GiteaMirror commented

2026-04-12 10:48:24 -05:00

@dhiltgen commented on GitHub (Feb 28, 2024):

Has anyone come up with a minimal repro with curl or equivalent? I'll try to repro and get to the bottom of this.

@dhiltgen commented on GitHub (Feb 28, 2024): Has anyone come up with a minimal repro with curl or equivalent? I'll try to repro and get to the bottom of this.

GiteaMirror commented

2026-04-12 10:48:25 -05:00

@wizardsd commented on GitHub (Mar 1, 2024):

I confirm this problem with 0.1.27 on Windows 10 without WSL. May be related with format=json and stream=false?

@wizardsd commented on GitHub (Mar 1, 2024): I confirm this problem with 0.1.27 on Windows 10 without WSL. May be related with format=json and stream=false?

GiteaMirror commented

2026-04-12 10:48:26 -05:00

@koleshjr commented on GitHub (Mar 4, 2024):

Could someone help us. This issue still persists :
I have updated to the latest release version: v0.1.28 and it still get's stuck after around 200 iterations on google collab free tier t4

@koleshjr commented on GitHub (Mar 4, 2024): Could someone help us. This issue still persists : I have updated to the latest release version: v0.1.28 and it still get's stuck after around 200 iterations on google collab free tier t4

GiteaMirror commented

2026-04-12 10:48:26 -05:00

@eusthace811 commented on GitHub (Mar 5, 2024):

In my case, it becomes unresponsive right after the initial interaction.

1 gpu Nvidia A4500, 13b q4 K M model

@eusthace811 commented on GitHub (Mar 5, 2024): In my case, it becomes unresponsive right after the initial interaction. 1 gpu Nvidia A4500, 13b q4 K M model

GiteaMirror commented

2026-04-12 10:48:27 -05:00

@giedriusrflt commented on GitHub (Mar 6, 2024):

Gets stuck also with:

$ ollama -v
ollama version is 0.1.28

@giedriusrflt commented on GitHub (Mar 6, 2024): Gets stuck also with: ```bash $ ollama -v ollama version is 0.1.28 ```

GiteaMirror commented

2026-04-12 10:48:27 -05:00

@jonomillin commented on GitHub (Mar 6, 2024):

I'm getting the same thing with 0.1.26, 0.1.27, 0.1.28 on an M2 Max (64gb ram)
This is both in the cli ollama run llava and via python APIs (chat and generate). It will work fine on one or two images, then stall out. There is no crash, it just stops streaming new tokens and hangs.

Server logs are as follows for a sample run via Python:

time=2024-03-07T07:46:10.314+09:00 level=INFO source=images.go:710 msg="total blobs: 14"
time=2024-03-07T07:46:10.315+09:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-07T07:46:10.316+09:00 level=INFO source=routes.go:1021 msg="Listening on 127.0.0.1:11434 (version 0.1.28)"
time=2024-03-07T07:46:10.316+09:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-07T07:46:10.338+09:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [metal]"
[GIN] 2024/03/07 - 07:46:10 | 200 |      45.458µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/03/07 - 07:46:10 | 200 |       578.5µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/03/07 - 07:46:10 | 200 |     373.083µs |       127.0.0.1 | POST     "/api/show"
time=2024-03-07T07:46:15.526+09:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/metal/libext_server.dylib"
time=2024-03-07T07:46:15.526+09:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
loading library /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/metal/libext_server.dylib
{"function":"load_model","level":"INFO","line":380,"msg":"Multi Modal Mode Enabled","tid":"0x171ef7000","timestamp":1709765175}
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423
ggml_metal_init: loading '/var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   595.50 MiB, (  596.44 / 49152.00)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    32.89 MiB, (  629.33 / 49152.00)
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /Users/jono/.ollama/models/blobs/sha256:170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = liuhaotian
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = liuhaotian
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  3847.58 MiB, ( 4476.91 / 49152.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:      Metal buffer size =  3847.57 MiB
................................
....................................
..............................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423
ggml_metal_init: loading '/var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   256.00 MiB, ( 4733.59 / 49152.00)
llama_kv_cache_init:      Metal KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    13.02 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   164.02 MiB, ( 4897.61 / 49152.00)
llama_new_context_with_model:      Metal compute buffer size =   164.00 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.00 MiB
llama_new_context_with_model: graph splits (measure): 2
clip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    377
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 377 tensors from /Users/jono/.ollama/models/blobs/sha256:72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = openai/clip-vit-large-patch14-336
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                        clip.projector_type str              = mlp
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  18:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  235 tensors
clip_model_load: - type  f16:  142 tensors
clip_model_load: CLIP using Metal backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     595.49 MB
clip_model_load: metadata size:  0.14 MB
clip_model_load: params backend buffer size =  595.49 MB (377 tensors)
clip_model_load: compute allocated memory: 32.89 MB
{"function":"initialize","level":"INFO","line":433,"msg":"initializing slots","n_slots":1,"tid":"0x171ef7000","timestamp":1709765176}
{"function":"initialize","level":"INFO","line":445,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"0x171ef7000","timestamp":1709765176}
time=2024-03-07T07:46:16.208+09:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
{"function":"update_slots","level":"INFO","line":1565,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"0x17349f000","timestamp":1709765176}
[GIN] 2024/03/07 - 07:46:16 | 200 |  5.709566167s |       127.0.0.1 | POST     "/api/chat"
update check failed - TypeError: fetch failed
time=2024-03-07T07:46:41.119+09:00 level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images"
{"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"0x17349f000","timestamp":1709765201}
{"function":"update_slots","level":"INFO","line":1825,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"0x17349f000","timestamp":1709765201}
encode_image_with_clip: image embedding created: 576 tokens

encode_image_with_clip: image encoded in   223.26 ms by CLIP (    0.39 ms per image patch)
{"function":"print_timings","level":"INFO","line":264,"msg":"prompt eval time     =    5730.16 ms /     1 tokens ( 5730.16 ms per token,     0.17 tokens per second)","n_prompt_tokens_processed":1,"n_tokens_second":0.17451528815004544,"slot_id":0,"t_prompt_processing":5730.157,"t_token":5730.157,"task_id":0,"tid":"0x17349f000","timestamp":1709765223}
{"function":"print_timings","level":"INFO","line":278,"msg":"generation eval time =   16943.76 ms /   411 runs   (   41.23 ms per token,    24.26 tokens per second)","n_decoded":411,"n_tokens_second":24.25672181205148,"slot_id":0,"t_token":41.225686131386865,"t_token_generation":16943.757,"task_id":0,"tid":"0x17349f000","timestamp":1709765223}
{"function":"print_timings","level":"INFO","line":287,"msg":"          total time =   22673.91 ms","slot_id":0,"t_prompt_processing":5730.157,"t_token_generation":16943.757,"t_total":22673.914,"task_id":0,"tid":"0x17349f000","timestamp":1709765223}
{"function":"update_slots","level":"INFO","line":1635,"msg":"slot released","n_cache_tokens":412,"n_ctx":2048,"n_past":1971,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"0x17349f000","timestamp":1709765223,"truncated":false}
[GIN] 2024/03/07 - 07:47:03 | 200 | 22.698469125s |       127.0.0.1 | POST     "/api/chat"
time=2024-03-07T07:47:03.825+09:00 level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images"
{"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":414,"tid":"0x17349f000","timestamp":1709765223}
{"function":"update_slots","level":"INFO","line":1825,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":414,"tid":"0x17349f000","timestamp":1709765223}

@jonomillin commented on GitHub (Mar 6, 2024): I'm getting the same thing with 0.1.26, 0.1.27, 0.1.28 on an M2 Max (64gb ram) This is both in the cli `ollama run llava` and via python APIs (chat and generate). It will work fine on one or two images, then stall out. There is no crash, it just stops streaming new tokens and hangs. Server logs are as follows for a sample run via Python: ``` time=2024-03-07T07:46:10.314+09:00 level=INFO source=images.go:710 msg="total blobs: 14" time=2024-03-07T07:46:10.315+09:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-07T07:46:10.316+09:00 level=INFO source=routes.go:1021 msg="Listening on 127.0.0.1:11434 (version 0.1.28)" time=2024-03-07T07:46:10.316+09:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-07T07:46:10.338+09:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [metal]" [GIN] 2024/03/07 - 07:46:10 | 200 | 45.458µs | 127.0.0.1 | HEAD "/" [GIN] 2024/03/07 - 07:46:10 | 200 | 578.5µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/07 - 07:46:10 | 200 | 373.083µs | 127.0.0.1 | POST "/api/show" time=2024-03-07T07:46:15.526+09:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/metal/libext_server.dylib" time=2024-03-07T07:46:15.526+09:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" loading library /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/metal/libext_server.dylib {"function":"load_model","level":"INFO","line":380,"msg":"Multi Modal Mode Enabled","tid":"0x171ef7000","timestamp":1709765175} ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 Max ggml_metal_init: picking default device: Apple M2 Max ggml_metal_init: default.metallib not found, loading from source ggml_metal_init: GGML_METAL_PATH_RESOURCES = /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423 ggml_metal_init: loading '/var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/ggml-metal.metal' ggml_metal_init: GPU name: Apple M2 Max ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 595.50 MiB, ( 596.44 / 49152.00) key clip.vision.image_grid_pinpoints not found in file key clip.vision.mm_patch_merge_type not found in file key clip.vision.image_crop_resolution not found in file ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 32.89 MiB, ( 629.33 / 49152.00) llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /Users/jono/.ollama/models/blobs/sha256:170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = liuhaotian llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = liuhaotian llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.22 MiB ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 3847.58 MiB, ( 4476.91 / 49152.00) llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: Metal buffer size = 3847.57 MiB ................................ .................................... .............................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 Max ggml_metal_init: picking default device: Apple M2 Max ggml_metal_init: default.metallib not found, loading from source ggml_metal_init: GGML_METAL_PATH_RESOURCES = /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423 ggml_metal_init: loading '/var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/ggml-metal.metal' ggml_metal_init: GPU name: Apple M2 Max ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 256.00 MiB, ( 4733.59 / 49152.00) llama_kv_cache_init: Metal KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CPU input buffer size = 13.02 MiB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 164.02 MiB, ( 4897.61 / 49152.00) llama_new_context_with_model: Metal compute buffer size = 164.00 MiB llama_new_context_with_model: CPU compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 2 clip_model_load: model name: openai/clip-vit-large-patch14-336 clip_model_load: description: image encoder for LLaVA clip_model_load: GGUF version: 3 clip_model_load: alignment: 32 clip_model_load: n_tensors: 377 clip_model_load: n_kv: 19 clip_model_load: ftype: f16 clip_model_load: loaded meta data with 19 key-value pairs and 377 tensors from /Users/jono/.ollama/models/blobs/sha256:72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539 clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output. clip_model_load: - kv 0: general.architecture str = clip clip_model_load: - kv 1: clip.has_text_encoder bool = false clip_model_load: - kv 2: clip.has_vision_encoder bool = true clip_model_load: - kv 3: clip.has_llava_projector bool = true clip_model_load: - kv 4: general.file_type u32 = 1 clip_model_load: - kv 5: general.name str = openai/clip-vit-large-patch14-336 clip_model_load: - kv 6: general.description str = image encoder for LLaVA clip_model_load: - kv 7: clip.projector_type str = mlp clip_model_load: - kv 8: clip.vision.image_size u32 = 336 clip_model_load: - kv 9: clip.vision.patch_size u32 = 14 clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1024 clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4096 clip_model_load: - kv 12: clip.vision.projection_dim u32 = 768 clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16 clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000010 clip_model_load: - kv 15: clip.vision.block_count u32 = 23 clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.481455, 0.457828, 0.408211] clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.268630, 0.261303, 0.275777] clip_model_load: - kv 18: clip.use_gelu bool = false clip_model_load: - type f32: 235 tensors clip_model_load: - type f16: 142 tensors clip_model_load: CLIP using Metal backend clip_model_load: text_encoder: 0 clip_model_load: vision_encoder: 1 clip_model_load: llava_projector: 1 clip_model_load: model size: 595.49 MB clip_model_load: metadata size: 0.14 MB clip_model_load: params backend buffer size = 595.49 MB (377 tensors) clip_model_load: compute allocated memory: 32.89 MB {"function":"initialize","level":"INFO","line":433,"msg":"initializing slots","n_slots":1,"tid":"0x171ef7000","timestamp":1709765176} {"function":"initialize","level":"INFO","line":445,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"0x171ef7000","timestamp":1709765176} time=2024-03-07T07:46:16.208+09:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" {"function":"update_slots","level":"INFO","line":1565,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"0x17349f000","timestamp":1709765176} [GIN] 2024/03/07 - 07:46:16 | 200 | 5.709566167s | 127.0.0.1 | POST "/api/chat" update check failed - TypeError: fetch failed time=2024-03-07T07:46:41.119+09:00 level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images" {"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"0x17349f000","timestamp":1709765201} {"function":"update_slots","level":"INFO","line":1825,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"0x17349f000","timestamp":1709765201} encode_image_with_clip: image embedding created: 576 tokens encode_image_with_clip: image encoded in 223.26 ms by CLIP ( 0.39 ms per image patch) {"function":"print_timings","level":"INFO","line":264,"msg":"prompt eval time = 5730.16 ms / 1 tokens ( 5730.16 ms per token, 0.17 tokens per second)","n_prompt_tokens_processed":1,"n_tokens_second":0.17451528815004544,"slot_id":0,"t_prompt_processing":5730.157,"t_token":5730.157,"task_id":0,"tid":"0x17349f000","timestamp":1709765223} {"function":"print_timings","level":"INFO","line":278,"msg":"generation eval time = 16943.76 ms / 411 runs ( 41.23 ms per token, 24.26 tokens per second)","n_decoded":411,"n_tokens_second":24.25672181205148,"slot_id":0,"t_token":41.225686131386865,"t_token_generation":16943.757,"task_id":0,"tid":"0x17349f000","timestamp":1709765223} {"function":"print_timings","level":"INFO","line":287,"msg":" total time = 22673.91 ms","slot_id":0,"t_prompt_processing":5730.157,"t_token_generation":16943.757,"t_total":22673.914,"task_id":0,"tid":"0x17349f000","timestamp":1709765223} {"function":"update_slots","level":"INFO","line":1635,"msg":"slot released","n_cache_tokens":412,"n_ctx":2048,"n_past":1971,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"0x17349f000","timestamp":1709765223,"truncated":false} [GIN] 2024/03/07 - 07:47:03 | 200 | 22.698469125s | 127.0.0.1 | POST "/api/chat" time=2024-03-07T07:47:03.825+09:00 level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images" {"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":414,"tid":"0x17349f000","timestamp":1709765223} {"function":"update_slots","level":"INFO","line":1825,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":414,"tid":"0x17349f000","timestamp":1709765223} ```

GiteaMirror commented

2026-04-12 10:48:28 -05:00

@jithinmukundan commented on GitHub (Mar 7, 2024):

I am facing the same issue after running it on gpu. I had no issues previously when running it only on cpu. Using 0.1.28 and llama index. Eagerly waiting for a solution.

@jithinmukundan commented on GitHub (Mar 7, 2024): I am facing the same issue after running it on gpu. I had no issues previously when running it only on cpu. Using 0.1.28 and llama index. Eagerly waiting for a solution.

GiteaMirror commented

2026-04-12 10:48:29 -05:00

@urinieto commented on GitHub (Mar 9, 2024):

Same issue here :(

I'm on version 0.1.28. It seems to stop working after ~100 to ~3000 queries in my Linux setup.

@urinieto commented on GitHub (Mar 9, 2024): Same issue here :( I'm on version 0.1.28. It seems to stop working after ~100 to ~3000 queries in my Linux setup.

GiteaMirror commented

2026-04-12 10:48:29 -05:00

@ckehagioglou commented on GitHub (Mar 10, 2024):

Has anyone come up with a minimal repro with curl or equivalent? I'll try to repro and get to the bottom of this.

Regretfully, haven't managed to do so. Nevertheless, I went through the logs and noticed that when Ollama hangs, instead of the normal functions sequence:
launch_slot_with_data: slot processing task -> update_slots: slot progression, kv cache -> print timings: prompt, generation -> update slots: slot release

it goes through the following:
launch_slot_with_data: slot processing task -> update_slots: slot progression, kv cache -> update_slots: slot context shift

The last function executes infinitely until I stop the server and relaunch it. So, might be related to another issue I found (sorry I haven't pinpointed the number) related to infinite context shifting.

Hope the above provides a bit of assistance.

@ckehagioglou commented on GitHub (Mar 10, 2024): > Has anyone come up with a minimal repro with curl or equivalent? I'll try to repro and get to the bottom of this. Regretfully, haven't managed to do so. Nevertheless, I went through the logs and noticed that when Ollama hangs, instead of the normal functions sequence: launch_slot_with_data: slot processing task -> update_slots: slot progression, kv cache -> print timings: prompt, generation -> update slots: slot release it goes through the following: launch_slot_with_data: slot processing task -> update_slots: slot progression, kv cache -> update_slots: slot context shift The last function executes infinitely until I stop the server and relaunch it. So, might be related to another issue I found (sorry I haven't pinpointed the number) related to infinite context shifting. Hope the above provides a bit of assistance.

GiteaMirror commented

2026-04-12 10:48:30 -05:00

@harmanpreet93 commented on GitHub (Mar 14, 2024):

Facing a similar issue inside docker on Ubuntu 18.04 with ollama version 0.1.28 on Quadro RTX 5000.

@harmanpreet93 commented on GitHub (Mar 14, 2024): Facing a similar issue inside docker on Ubuntu 18.04 with `ollama version 0.1.28` on Quadro RTX 5000.

GiteaMirror commented

2026-04-12 10:48:30 -05:00

@syrom commented on GitHub (Mar 14, 2024):

Same issue. Running ollama 0.1.28 on M1 Max.
My observation:
I work on a large number of text chunks as input for a RAG algorithms - and the task for the LLM (mixtral in my case) is to extract keywords and concepts from the chunks. The document is rather large; so it makes quite a diffference if I set the character count for the text split to produce chunks of a length of 1.000 or 2.000 characters. These chunks are served to Mixtral as USER_PROMPT - and the SYSTEM_PROMPT by itself is also rather long.
Now the key observation: the failure seems to be functionally dependent on the length of the overall prompt.
If I set text-split to 2.000 characters, the overall prompt lenghth is much longer - and the failure occurs much quicker (5-10 generations) than if the text-split is set to 1.000 characters (around 15-20 generations). Unfortunately, the algorithsm ought to work its way thru > 400 to 800 chunks.... which it doesn't.
Long story short: the occurence of the bug seems to be a function of the number of tokens being served to the LLM thru Ollama.

@syrom commented on GitHub (Mar 14, 2024): Same issue. Running ollama 0.1.28 on M1 Max. My observation: I work on a large number of text chunks as input for a RAG algorithms - and the task for the LLM (mixtral in my case) is to extract keywords and concepts from the chunks. The document is rather large; so it makes quite a diffference if I set the character count for the text split to produce chunks of a length of 1.000 or 2.000 characters. These chunks are served to Mixtral as USER_PROMPT - and the SYSTEM_PROMPT by itself is also rather long. Now the key observation: the failure seems to be functionally dependent on the length of the overall prompt. If I set text-split to 2.000 characters, the overall prompt lenghth is much longer - and the failure occurs much quicker (5-10 generations) than if the text-split is set to 1.000 characters (around 15-20 generations). Unfortunately, the algorithsm ought to work its way thru > 400 to 800 chunks.... which it doesn't. Long story short: the occurence of the bug seems to be a function of the number of tokens being served to the LLM thru Ollama.

GiteaMirror commented

2026-04-12 10:48:30 -05:00

@niyogrv commented on GitHub (Mar 20, 2024):

I'm observing this issue in 0.1.28 on Ubuntu 22.04 with a 3060(Driver: 535.161.07, CUDA: 12.2) and 16GB RAM running TheBloke's Q6 Mistral Instruct v0.2 GGUF.
I'm encountering this only when I'm setting "format"="json". I am using the model for a classification task and only got through 5 queries before it hung up and I had to restart ollama. I was able to reproduce this consistently and it always failed at the 6th query

I reran it, this time without the "format"="json" param, and I am 4k+ requests in without a crash

UPDATE:
It crashed at around 5.7k requests. So, while the json format enforcement seems to accelerate the issue, it still seems to happen if you're constantly bombarding the model with requests. Hopefully, this gets fixed soon :(

[GIN] 2024/03/20 - 10:19:18 | 200 |  2.654764194s | 2406:8800:80:b281:35bb:b529:139e:a792 | POST     "/api/generate"
time=2024-03-20T12:09:40.398+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-20T12:09:40.398+05:30 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-20T12:09:40.398+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-20T12:09:40.398+05:30 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-20T12:09:40.398+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
loading library /tmp/ollama1233141962/cuda_v11/libext_server.so
time=2024-03-20T12:09:40.399+05:30 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama1233141962/cuda_v11/libext_server.so"
time=2024-03-20T12:09:40.399+05:30 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /home/pcname/.ollama/models/blobs/sha256:a4643671c92f47eb6027d0eff50b9875562e8e172128a4b10b2be250bb4264de (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 18
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:  226 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 5.53 GiB (6.56 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   102.54 MiB
llm_load_tensors:      CUDA0 buffer size =  5563.55 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llama_kv_cache_init:      CUDA0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    13.02 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   164.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     8.00 MiB
llama_new_context_with_model: graph splits (measure): 2
{"function":"initialize","level":"INFO","line":433,"msg":"initializing slots","n_slots":1,"tid":"125807536088640","timestamp":1710916782}
{"function":"initialize","level":"INFO","line":442,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"125807536088640","timestamp":1710916782}
time=2024-03-20T12:09:42.092+05:30 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
{"function":"update_slots","level":"INFO","line":1565,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"125806336538176","timestamp":1710916782}
{"function":"launch_slot_with_data","level":"INFO","line":823,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916782}
{"function":"update_slots","level":"INFO","line":1796,"msg":"slot progression","n_past":0,"n_prompt_tokens_processed":507,"slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916782}
{"function":"update_slots","level":"INFO","line":1821,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916782}
{"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =     696.16 ms /   507 tokens (    1.37 ms per token,   728.28 tokens per second)","n_prompt_tokens_processed":507,"n_tokens_second":728.2839934095325,"slot_id":0,"t_prompt_processing":696.157,"t_token":1.3730907297830375,"task_id":0,"tid":"125806336538176","timestamp":1710916784}
{"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =    1756.20 ms /    77 runs   (   22.81 ms per token,    43.84 tokens per second)","n_decoded":77,"n_tokens_second":43.8446396511561,"slot_id":0,"t_token":22.807805194805194,"t_token_generation":1756.201,"task_id":0,"tid":"125806336538176","timestamp":1710916784}
{"function":"print_timings","level":"INFO","line":281,"msg":"          total time =    2452.36 ms","slot_id":0,"t_prompt_processing":696.157,"t_token_generation":1756.201,"t_total":2452.358,"task_id":0,"tid":"125806336538176","timestamp":1710916784}
{"function":"update_slots","level":"INFO","line":1627,"msg":"slot released","n_cache_tokens":584,"n_ctx":2048,"n_past":583,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916784,"truncated":false}
[GIN] 2024/03/20 - 12:09:44 | 200 |  4.410814509s |   116.68.72.166 | POST     "/api/generate"

@niyogrv commented on GitHub (Mar 20, 2024): I'm observing this issue in 0.1.28 on Ubuntu 22.04 with a 3060(Driver: 535.161.07, CUDA: 12.2) and 16GB RAM running TheBloke's Q6 Mistral Instruct v0.2 GGUF. I'm encountering this only when I'm setting "format"="json". I am using the model for a classification task and only got through 5 queries before it hung up and I had to restart ollama. I was able to reproduce this consistently and it always failed at the 6th query I reran it, this time without the "format"="json" param, and I am 4k+ requests in without a crash UPDATE: It crashed at around 5.7k requests. So, while the json format enforcement seems to accelerate the issue, it still seems to happen if you're constantly bombarding the model with requests. Hopefully, this gets fixed soon :( ``` [GIN] 2024/03/20 - 10:19:18 | 200 | 2.654764194s | 2406:8800:80:b281:35bb:b529:139e:a792 | POST "/api/generate" time=2024-03-20T12:09:40.398+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-20T12:09:40.398+05:30 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-20T12:09:40.398+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-20T12:09:40.398+05:30 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-20T12:09:40.398+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" loading library /tmp/ollama1233141962/cuda_v11/libext_server.so time=2024-03-20T12:09:40.399+05:30 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama1233141962/cuda_v11/libext_server.so" time=2024-03-20T12:09:40.399+05:30 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /home/pcname/.ollama/models/blobs/sha256:a4643671c92f47eb6027d0eff50b9875562e8e172128a4b10b2be250bb4264de (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2 llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 18 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q6_K: 226 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q6_K llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 5.53 GiB (6.56 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.22 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 102.54 MiB llm_load_tensors: CUDA0 buffer size = 5563.55 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes ggml_init_cublas: CUDA_USE_TENSOR_CORES: no ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 164.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 2 {"function":"initialize","level":"INFO","line":433,"msg":"initializing slots","n_slots":1,"tid":"125807536088640","timestamp":1710916782} {"function":"initialize","level":"INFO","line":442,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"125807536088640","timestamp":1710916782} time=2024-03-20T12:09:42.092+05:30 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" {"function":"update_slots","level":"INFO","line":1565,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"125806336538176","timestamp":1710916782} {"function":"launch_slot_with_data","level":"INFO","line":823,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916782} {"function":"update_slots","level":"INFO","line":1796,"msg":"slot progression","n_past":0,"n_prompt_tokens_processed":507,"slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916782} {"function":"update_slots","level":"INFO","line":1821,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916782} {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 696.16 ms / 507 tokens ( 1.37 ms per token, 728.28 tokens per second)","n_prompt_tokens_processed":507,"n_tokens_second":728.2839934095325,"slot_id":0,"t_prompt_processing":696.157,"t_token":1.3730907297830375,"task_id":0,"tid":"125806336538176","timestamp":1710916784} {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 1756.20 ms / 77 runs ( 22.81 ms per token, 43.84 tokens per second)","n_decoded":77,"n_tokens_second":43.8446396511561,"slot_id":0,"t_token":22.807805194805194,"t_token_generation":1756.201,"task_id":0,"tid":"125806336538176","timestamp":1710916784} {"function":"print_timings","level":"INFO","line":281,"msg":" total time = 2452.36 ms","slot_id":0,"t_prompt_processing":696.157,"t_token_generation":1756.201,"t_total":2452.358,"task_id":0,"tid":"125806336538176","timestamp":1710916784} {"function":"update_slots","level":"INFO","line":1627,"msg":"slot released","n_cache_tokens":584,"n_ctx":2048,"n_past":583,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916784,"truncated":false} [GIN] 2024/03/20 - 12:09:44 | 200 | 4.410814509s | 116.68.72.166 | POST "/api/generate" ```

GiteaMirror commented

2026-04-12 10:48:31 -05:00

@dhiltgen commented on GitHub (Mar 20, 2024):

This will likely be resolved with #3218 but I'll leave this open until we can verify the health check logic is sufficient to catch this hang scenario.

@dhiltgen commented on GitHub (Mar 20, 2024): This will likely be resolved with #3218 but I'll leave this open until we can verify the health check logic is sufficient to catch this hang scenario.

GiteaMirror commented

2026-04-12 10:48:31 -05:00

@syrom commented on GitHub (Mar 31, 2024):

My first feedback after the last Ollama update: the situation has improved a lot, but has not gone away alltogether.
I tried it out on several text sizes - and now it works on longer texts, but still gets eventually stuck on very long texts.
Before the last update, I would not get more than 30 generations in a row, feeding the algorithms with text chunks of the size of 1.000 characters. Now, it works up to around 100 generations and slightly north of that.
But e.g. processing text consisting of 200 or more chunks still gets the process stuck eventually.

@syrom commented on GitHub (Mar 31, 2024): My first feedback after the last Ollama update: the situation has improved a lot, but has not gone away alltogether. I tried it out on several text sizes - and now it works on longer texts, but still gets eventually stuck on very long texts. Before the last update, I would not get more than 30 generations in a row, feeding the algorithms with text chunks of the size of 1.000 characters. Now, it works up to around 100 generations and slightly north of that. But e.g. processing text consisting of 200 or more chunks still gets the process stuck eventually.

GiteaMirror commented

2026-04-12 10:48:32 -05:00

@omani commented on GitHub (Apr 5, 2024):

I have to stop docker. rm the docker instance and run it again to solve this issue. I hope someone fixes this soon.

@omani commented on GitHub (Apr 5, 2024): I have to stop docker. rm the docker instance and run it again to solve this issue. I hope someone fixes this soon.

GiteaMirror commented

2026-04-12 10:48:32 -05:00

@omani commented on GitHub (Apr 5, 2024):

here is an example of my local ollama in docker hallucinating:

 docker exec -it ollama ollama run starcoder2
>>> write a simple hello world program in golang
 and build an image from it.

This is the command that will be used:
```sh
docker run -it --name hello-go-app --rm -v ${PWD}:/go/src/hello-go-app golang:1.7 go run /go/src/hello-go-app/*.go

## Building the image

The first step is to create a Docker file that will contain the commands for building and running our application. The Dockerfile for our app will have two sections. We will use COPY and ENTRYPOINT for
this. Copy will copy the contents of our app folder into /go/src/hello^C

this happens with almost all models after some time. sometimes within minutes of cancelling and restarting the model.

@omani commented on GitHub (Apr 5, 2024): here is an example of my local ollama in docker hallucinating: ``` docker exec -it ollama ollama run starcoder2 >>> write a simple hello world program in golang and build an image from it. This is the command that will be used: ```sh docker run -it --name hello-go-app --rm -v ${PWD}:/go/src/hello-go-app golang:1.7 go run /go/src/hello-go-app/*.go ## Building the image The first step is to create a Docker file that will contain the commands for building and running our application. The Dockerfile for our app will have two sections. We will use COPY and ENTRYPOINT for this. Copy will copy the contents of our app folder into /go/src/hello^C ``` this happens with almost all models after some time. sometimes within minutes of cancelling and restarting the model.

GiteaMirror commented

2026-04-12 10:48:32 -05:00

@omani commented on GitHub (Apr 5, 2024):

what is this? why is this happening with all my models? does it have anything to do with ollama?

 docker exec -it ollama ollama run starcoder2
>>> write a simple hello world program in golang.
 1.

Go is a compiled language. You can't compile Go programs with gcc or
clang on Linux. So the first thing you need to do is install a C compiler.
In this case, you will use GCC. If you don’t know how to do that, please consult
the previous post: Installing^C

>>> Send a message (/? for help)

@omani commented on GitHub (Apr 5, 2024): what is this? why is this happening with all my models? does it have anything to do with ollama? ``` docker exec -it ollama ollama run starcoder2 >>> write a simple hello world program in golang. 1. Go is a compiled language. You can't compile Go programs with gcc or clang on Linux. So the first thing you need to do is install a C compiler. In this case, you will use GCC. If you don’t know how to do that, please consult the previous post: Installing^C >>> Send a message (/? for help) ```

GiteaMirror commented

2026-04-12 10:48:33 -05:00

@traddo commented on GitHub (Apr 7, 2024):

I ran into the same snag when I was working on summarizing text. Tweaking the prompt words sorted it out for me. I added a bit in the prompt to make sure the summary stays between 100 and 200 words.

@traddo commented on GitHub (Apr 7, 2024): I ran into the same snag when I was working on summarizing text. Tweaking the prompt words sorted it out for me. I added a bit in the prompt to make sure the summary stays between 100 and 200 words.

GiteaMirror commented

2026-04-12 10:48:33 -05:00

@Mecil9 commented on GitHub (Apr 7, 2024):

the same issue!

my system:
apple m1 max 64G
At the initial run, everything works fine. When questions are asked continuously, the system will be stuck, the CPU usage will continue to increase, and the GPU will be reduced to 0 at the same time.
Once the CPU reaches 100%, ollama will stop working. I have tried many methods to no avail!

@Mecil9 commented on GitHub (Apr 7, 2024): the same issue! ![image](https://github.com/ollama/ollama/assets/2948534/38769b40-36fd-47ae-8d41-ecc801eaa92b) my system: apple m1 max 64G At the initial run, everything works fine. When questions are asked continuously, the system will be stuck, the CPU usage will continue to increase, and the GPU will be reduced to 0 at the same time. Once the CPU reaches 100%, ollama will stop working. I have tried many methods to no avail!

GiteaMirror commented

2026-04-12 10:48:33 -05:00

@WithAnOrchid commented on GitHub (Apr 8, 2024):

I ran into the same snag when I was working on summarizing text. Tweaking the prompt words sorted it out for me. I added a bit in the prompt to make sure the summary stays between 100 and 200 words.

This happened to me as well. After some research and test, I found setting the option num_kepp to 0 fixed this issue.

Possibly related to #2805 , #2225

Python code I used:

def send_summarization_request(text, system_prompt):
    url = summarization_endpoint
    payload = {
        "model": model,
        "prompt": text,
        "system": system_prompt,
        "stream": False,
        "keep_alive": "5m",
        "options": {
            "num_keep": 0,
            "num_batch": 8
        }
    }
    response = requests.post(url, json=payload)
    summary = json.loads(response.text)

    response_text = summary["response"]

    return response_text

@WithAnOrchid commented on GitHub (Apr 8, 2024): > I ran into the same snag when I was working on summarizing text. Tweaking the prompt words sorted it out for me. I added a bit in the prompt to make sure the summary stays between 100 and 200 words. This happened to me as well. After some research and test, I found setting the option `num_kepp` to `0` fixed this issue. Possibly related to #2805 , #2225 Python code I used: ``` def send_summarization_request(text, system_prompt): url = summarization_endpoint payload = { "model": model, "prompt": text, "system": system_prompt, "stream": False, "keep_alive": "5m", "options": { "num_keep": 0, "num_batch": 8 } } response = requests.post(url, json=payload) summary = json.loads(response.text) response_text = summary["response"] return response_text ```

GiteaMirror commented

2026-04-12 10:48:34 -05:00

@ckehagioglou commented on GitHub (Apr 9, 2024):

I ran into the same snag when I was working on summarizing text. Tweaking the prompt words sorted it out for me. I added a bit in the prompt to make sure the summary stays between 100 and 200 words.

This happened to me as well. After some research and test, I found setting the option num_kepp to 0 fixed this issue.

Possibly related to #2805 , #2225

Python code I used:
def send_summarization_request(text, system_prompt):
    url = summarization_endpoint
    payload = {
        "model": model,
        "prompt": text,
        "system": system_prompt,
        "stream": False,
        "keep_alive": "5m",
        "options": {
            "num_keep": 0,
            "num_batch": 8
        }
    }
    response = requests.post(url, json=payload)
    summary = json.loads(response.text)

    response_text = summary["response"]

    return response_text

Looked all over the place to find what num_keep does but with no avail. All I found is that num_keep default value is 0. Version 1.31 hangs even more often for me. Team ollama is doing a great work, but this bug is destroying the experience.

Working on a Mac Studio M2 Max 32GB, running many summarization tasks in sequence - if it helps.

@ckehagioglou commented on GitHub (Apr 9, 2024): > > I ran into the same snag when I was working on summarizing text. Tweaking the prompt words sorted it out for me. I added a bit in the prompt to make sure the summary stays between 100 and 200 words. > > This happened to me as well. After some research and test, I found setting the option `num_kepp` to `0` fixed this issue. > > Possibly related to #2805 , #2225 > > Python code I used: > > ``` > def send_summarization_request(text, system_prompt): > url = summarization_endpoint > payload = { > "model": model, > "prompt": text, > "system": system_prompt, > "stream": False, > "keep_alive": "5m", > "options": { > "num_keep": 0, > "num_batch": 8 > } > } > response = requests.post(url, json=payload) > summary = json.loads(response.text) > > response_text = summary["response"] > > return response_text > ``` Looked all over the place to find what num_keep does but with no avail. All I found is that num_keep default value is 0. Version 1.31 hangs even more often for me. Team ollama is doing a great work, but this bug is destroying the experience. Working on a Mac Studio M2 Max 32GB, running many summarization tasks in sequence - if it helps.

GiteaMirror commented

2026-04-12 10:48:34 -05:00

@traddo commented on GitHub (Apr 9, 2024):

I added the num_keep parameter but the bug still exists. For now, I'm using a timeout to kill the ollama process as a workaround to complete the batch summary tasks.
I use a bash script to start the 'ollama serve &' process, checking every 2 seconds to see if the ollama process exists, and if not, start it.
When calling the API, I add a timeout limit of 1 minute. If a timeout exception occurs, I kill the ollama process, then wait for 2 seconds, remove the task causing the timeout, and start the loop again.

@traddo commented on GitHub (Apr 9, 2024): I added the num_keep parameter but the bug still exists. For now, I'm using a timeout to kill the ollama process as a workaround to complete the batch summary tasks. I use a bash script to start the 'ollama serve &' process, checking every 2 seconds to see if the ollama process exists, and if not, start it. When calling the API, I add a timeout limit of 1 minute. If a timeout exception occurs, I kill the ollama process, then wait for 2 seconds, remove the task causing the timeout, and start the loop again.

GiteaMirror commented

2026-04-12 10:48:34 -05:00

@abhinav-kashyap-asus commented on GitHub (Apr 9, 2024):

I also have this bug... :( Unfortunately sometimes even restarting the ollama server is not helping it
It just hangs

@abhinav-kashyap-asus commented on GitHub (Apr 9, 2024): I also have this bug... :( Unfortunately sometimes even restarting the ollama server is not helping it It just hangs

GiteaMirror commented

2026-04-12 10:48:35 -05:00

@mrroll commented on GitHub (Apr 9, 2024):

I have the same experience. Adding the parameter does not prevent Ollama from getting stuck.

@mrroll commented on GitHub (Apr 9, 2024): I have the same experience. Adding the parameter does not prevent Ollama from getting stuck.

GiteaMirror commented

2026-04-12 10:48:35 -05:00

@danest commented on GitHub (Apr 9, 2024):

This happens to me too so I wrote a bash script that manages and just restarts it every 10 minutes....

start_ollama() {
    nohup ollama serve &  # Start ollama serve in the background and ignore hangups
    echo "Ollama serve started."
}

while true; do

    ollama_pid=$(pgrep -f 'ollama')

    if [ ! -z "$ollama_pid" ]; then
        echo "Killing ollama serve process: $ollama_pid"
        kill $ollama_pid
        sleep 2
    fi
    start_ollama

    sleep 600
done

@danest commented on GitHub (Apr 9, 2024): This happens to me too so I wrote a bash script that manages and just restarts it every 10 minutes.... ``` start_ollama() { nohup ollama serve & # Start ollama serve in the background and ignore hangups echo "Ollama serve started." } while true; do ollama_pid=$(pgrep -f 'ollama') if [ ! -z "$ollama_pid" ]; then echo "Killing ollama serve process: $ollama_pid" kill $ollama_pid sleep 2 fi start_ollama sleep 600 done ```

GiteaMirror commented

2026-04-12 10:48:35 -05:00

@jdonaldson commented on GitHub (Apr 9, 2024):

Hitting the stability issue here as well. Had to add a reset action in my
neovim so I could poke it awake more easily :
https://github.com/jdonaldson/dotfiles/blob/main/.config/lvim/config.lua#L45

On Tue, Apr 9, 2024 at 9:17 AM Kevin Urrutia @.***>
wrote:

This happens to me too so I wrote a bash script that manages and just
restarts it every 10 minutes....

start_ollama() {
nohup ollama serve & # Start ollama serve in the background and ignore hangups
echo "Ollama serve started."
}

while true; do
ollama_pid=$(pgrep -f 'ollama')

if [ ! -z "$ollama_pid" ]; then
    echo "Killing ollama serve process: $ollama_pid"
    kill $ollama_pid
    sleep 2
fi
start_ollama

sleep 600
done

—
Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/1863#issuecomment-2045588515,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAKEB2XJ24THSMCWDGVFFLY4QICDAVCNFSM6AAAAABBSZT5G6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBVGU4DQNJRGU
.
You are receiving this because you are subscribed to this thread.Message
ID: @.***>

@jdonaldson commented on GitHub (Apr 9, 2024): Hitting the stability issue here as well. Had to add a reset action in my neovim so I could poke it awake more easily : https://github.com/jdonaldson/dotfiles/blob/main/.config/lvim/config.lua#L45 On Tue, Apr 9, 2024 at 9:17 AM Kevin Urrutia ***@***.***> wrote: > This happens to me too so I wrote a bash script that manages and just > restarts it every 10 minutes.... > > start_ollama() { > nohup ollama serve & # Start ollama serve in the background and ignore hangups > echo "Ollama serve started." > } > > while true; do > > ollama_pid=$(pgrep -f 'ollama') > > if [ ! -z "$ollama_pid" ]; then > echo "Killing ollama serve process: $ollama_pid" > kill $ollama_pid > sleep 2 > fi > start_ollama > > sleep 600 > done > > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/1863#issuecomment-2045588515>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAAKEB2XJ24THSMCWDGVFFLY4QICDAVCNFSM6AAAAABBSZT5G6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBVGU4DQNJRGU> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> >

GiteaMirror commented

2026-04-12 10:48:36 -05:00

@dmitry-sablin-db commented on GitHub (Apr 12, 2024):

Got same issue , seem to be problems caused by using codelama 34b . But its not exactly , just iterative check

@dmitry-sablin-db commented on GitHub (Apr 12, 2024): Got same issue , seem to be problems caused by using codelama 34b . But its not exactly , just iterative check

GiteaMirror commented

2026-04-12 10:48:36 -05:00

@jtoy commented on GitHub (Apr 12, 2024):

Just to give more notes, I use ollama on mac and linux. On linux for me it seems moe stable. On a mac studio m1 and mac book pro m1, I have to restart it every dozen or so requests because it just freezes. I want to run this on my mac studio as a server, but its too unstable. I am going to just add a restart script every hour to see if that fixes it.

@jtoy commented on GitHub (Apr 12, 2024): Just to give more notes, I use ollama on mac and linux. On linux for me it seems moe stable. On a mac studio m1 and mac book pro m1, I have to restart it every dozen or so requests because it just freezes. I want to run this on my mac studio as a server, but its too unstable. I am going to just add a restart script every hour to see if that fixes it.

GiteaMirror commented

2026-04-12 10:48:36 -05:00

@javierrivarola commented on GitHub (Apr 15, 2024):

Same issue running on Macbook 16 M3 Max with 36 GB of ram, ollama hangs after an hour or so of usage, logs don't seem to indicate nothing wrong happened.. seems that i'll need to use a cronjob to restart it every hour

@javierrivarola commented on GitHub (Apr 15, 2024): Same issue running on Macbook 16 M3 Max with 36 GB of ram, ollama hangs after an hour or so of usage, logs don't seem to indicate nothing wrong happened.. seems that i'll need to use a cronjob to restart it every hour

GiteaMirror commented

2026-04-12 10:48:37 -05:00

@danomatika commented on GitHub (Apr 16, 2024):

We are seeing the same issue with Ubuntu 20.04 LTS and 2 x A100. So far I am taking a timeout check and restart approach by running the following script every 10 minutes with cron, ollama-check:

#! /bin/sh
# check if ollama api is not responding and restart service after timeout
# Dan Wilcox, ZKM | Hertzlab, zkm.de

# host
HOST=YOURSERVER:11434

# desired model
MODEL=llama2

# timeout in seconds
TIMEOUT=30

# make api call with timeout
curl --connect-timeout 5 --max-time $TIMEOUT -d "{
  \"model\": \"$MODEL\",
  \"prompt\": \"Why is the sky blue?\",
  \"format\": \"json\",
  \"stream\": false
}" http://${HOST}/api/generate 2> /dev/null

# force restart ollama on *any* non-zero exit code:
# https://everything.curl.dev/cmdline/exitcode.html
# ex. to check just timeout, use [ "$?" = "28" ]
if [ "$?" != "0" ] ; then
  killall ollama 2> /dev/null
  systemctl stop ollama 2> /dev/null
  systemctl start ollama
fi

This may not be the best solution, but we will try it for now.

@danomatika commented on GitHub (Apr 16, 2024): We are seeing the same issue with Ubuntu 20.04 LTS and 2 x A100. So far I am taking a timeout check and restart approach by running the following script every 10 minutes with cron, `ollama-check`: ```sh #! /bin/sh # check if ollama api is not responding and restart service after timeout # Dan Wilcox, ZKM | Hertzlab, zkm.de # host HOST=YOURSERVER:11434 # desired model MODEL=llama2 # timeout in seconds TIMEOUT=30 # make api call with timeout curl --connect-timeout 5 --max-time $TIMEOUT -d "{ \"model\": \"$MODEL\", \"prompt\": \"Why is the sky blue?\", \"format\": \"json\", \"stream\": false }" http://${HOST}/api/generate 2> /dev/null # force restart ollama on *any* non-zero exit code: # https://everything.curl.dev/cmdline/exitcode.html # ex. to check just timeout, use [ "$?" = "28" ] if [ "$?" != "0" ] ; then killall ollama 2> /dev/null systemctl stop ollama 2> /dev/null systemctl start ollama fi ``` This may not be the best solution, but we will try it for now.

GiteaMirror commented

2026-04-12 10:48:37 -05:00

@jossalgon commented on GitHub (Apr 16, 2024):

By modifying this with the latest version, I have not had any more problems. If anyone tries it too and it works for them I can do PR.

---
 llm/server.go | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/llm/server.go b/llm/server.go
index 0e084d5..f32e25c 100644
--- a/llm/server.go
+++ b/llm/server.go
@@ -503,6 +503,11 @@ type CompletionResponse struct {
 }
 
 func (s *LlamaServer) Completion(ctx context.Context, req CompletionRequest, fn func(CompletionResponse)) error {
+	// Set a timeout for the request
+	var cancelFunc context.CancelFunc
+      ctx, cancelFunc = context.WithTimeout(ctx, 60*time.Second)
+      defer cancelFunc()
+
 	request := map[string]any{
 		"prompt":            req.Prompt,
 		"stream":            true,
---

@jossalgon commented on GitHub (Apr 16, 2024): By modifying this with the latest version, I have not had any more problems. If anyone tries it too and it works for them I can do PR. ``` --- llm/server.go | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/llm/server.go b/llm/server.go index 0e084d5..f32e25c 100644 --- a/llm/server.go +++ b/llm/server.go @@ -503,6 +503,11 @@ type CompletionResponse struct { } func (s *LlamaServer) Completion(ctx context.Context, req CompletionRequest, fn func(CompletionResponse)) error { + // Set a timeout for the request + var cancelFunc context.CancelFunc + ctx, cancelFunc = context.WithTimeout(ctx, 60*time.Second) + defer cancelFunc() + request := map[string]any{ "prompt": req.Prompt, "stream": true, --- ```

GiteaMirror commented

2026-04-12 10:48:37 -05:00

@dhiltgen commented on GitHub (Apr 17, 2024):

Please give 0.1.32 a try and let us know if you're still seeing unrecoverable hangs.

@dhiltgen commented on GitHub (Apr 17, 2024): Please give 0.1.32 a try and let us know if you're still seeing unrecoverable hangs.

GiteaMirror commented

2026-04-12 10:48:38 -05:00

@airbj31 commented on GitHub (Apr 17, 2024):

Please give 0.1.32 a try and let us know if you're still seeing unrecoverable hangs.

I still have the same issue in both Linux computer (Ubuntu22.04 + GTX4090) and Macbook Pro (M3), but the tendency was reduced compared to the previous version (v0.1.30)

@airbj31 commented on GitHub (Apr 17, 2024): > Please give 0.1.32 a try and let us know if you're still seeing unrecoverable hangs. I still have the same issue in both Linux computer (Ubuntu22.04 + GTX4090) and Macbook Pro (M3), but the tendency was reduced compared to the previous version (v0.1.30)

GiteaMirror commented

2026-04-12 10:48:38 -05:00

@calebdel commented on GitHub (Apr 17, 2024):

0.1.32 seems to have fixed the issue for me. 2000+ iterations so far without a hang. Previously 5-10 iterations would cause Ollama to hang.

@calebdel commented on GitHub (Apr 17, 2024): 0.1.32 seems to have fixed the issue for me. 2000+ iterations so far without a hang. Previously 5-10 iterations would cause Ollama to hang.

GiteaMirror commented

2026-04-12 10:48:39 -05:00

@BruceMacD commented on GitHub (Apr 18, 2024):

Thanks to everyone for reporting and testing this. Marking this as resolved for now pending any more reports.

@BruceMacD commented on GitHub (Apr 18, 2024): Thanks to everyone for reporting and testing this. Marking this as resolved for now pending any more reports.

GiteaMirror commented

2026-04-12 10:48:39 -05:00

@kungfu-eric commented on GitHub (Apr 25, 2024):

Hangs after about 400 long context requests on mixtral and same with llama3

ollama --version
ollama version is 0.1.32

Please give 0.1.32 a try and let us know if you're still seeing unrecoverable hangs.

The hang continues to output this on the ollama server but no response is given to the client

{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056803}
{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056823}
{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056843}

@kungfu-eric commented on GitHub (Apr 25, 2024): Hangs after about 400 long context requests on mixtral and same with llama3 ``` ollama --version ollama version is 0.1.32 ``` > Please give 0.1.32 a try and let us know if you're still seeing unrecoverable hangs. The hang continues to output this on the ollama server but no response is given to the client ``` {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056803} {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056823} {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056843} ```

GiteaMirror commented

2026-04-12 10:48:39 -05:00

@entmike commented on GitHub (Apr 25, 2024):

Still having the problem here on version 0.1.32. I am running batches of image annotations with llava and it will just hang after a few dozen images or so.

RTX 4090
Ubuntu 22.04
Running via Docker container

Restarting the container kicks it back into submission but looking for a a more reliable answer.

@entmike commented on GitHub (Apr 25, 2024): Still having the problem here on version `0.1.32`. I am running batches of image annotations with llava and it will just hang after a few dozen images or so. RTX 4090 Ubuntu 22.04 Running via Docker container Restarting the container kicks it back into submission but looking for a a more reliable answer.

GiteaMirror commented

2026-04-12 10:48:40 -05:00

@kirill-vas commented on GitHub (Apr 26, 2024):

Also still experiencing hangs when calling /api/chat endpoint. Running on HumanEval benchmark (164 samples), usually fails about 70-80 calls. Requires ollama serve restart to recover. Mostly happens with CodeLlama-70b rather than the smaller models (13b, 7b; only tested these).

Running v0.1.32 on Ubuntu 22.04.2 with NVIDIA RTX A6000, driver 530.30.02, CUDA 12.1, using a Docker container

The specific part where it seems to loop indefinitely is the update_slots function
with "msg":"slot context shift" line from ollama serve logs (full logs below):

{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984503}

Code that calls the endpoint:

    url = "http://localhost:11434/api/chat"
    params = {
        'model': 'codellama:70b-instruct',
        'options': {
            'num_ctx': 4096
        },
        'messages': [
            {'role': 'system', 'content': sys_prompt},
            {"role": "user", "content": user_prompt},
        ],
        'options': {
            'seed': 123,
            'temperature': 0.2,
        },
        'stream': False,
        'keep_alive': 10 
    }
    response = requests.post(url, json=params).json()

Full log of the run below:

time=2024-04-24T14:30:43.820-04:00 level=INFO source=gpu.go:121 msg="Detecting GPU type"
time=2024-04-24T14:30:43.820-04:00 level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
time=2024-04-24T14:30:43.824-04:00 level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1496169119/runners/cuda_v11/libcudart.so.11.0]"
time=2024-04-24T14:30:44.105-04:00 level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
time=2024-04-24T14:30:44.105-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-24T14:30:44.387-04:00 level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6"
time=2024-04-24T14:30:44.437-04:00 level=INFO source=gpu.go:121 msg="Detecting GPU type"
time=2024-04-24T14:30:44.437-04:00 level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
time=2024-04-24T14:30:44.440-04:00 level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1496169119/runners/cuda_v11/libcudart.so.11.0]"
time=2024-04-24T14:30:44.440-04:00 level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
time=2024-04-24T14:30:44.440-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-24T14:30:44.505-04:00 level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6"
time=2024-04-24T14:30:44.536-04:00 level=INFO source=server.go:127 msg="offload to gpu" reallayers=81 layers=81 required="38351.2 MiB" used="38351.2 MiB" available="45354.3 MiB" kv="640.0 MiB" fulloffload="324.0 MiB" partialoffload="348.0 MiB"
time=2024-04-24T14:30:44.536-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-24T14:30:44.536-04:00 level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1496169119/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-1436d66b69757a245f02d000874c670507949d11ad5c188a623652052c6aa508 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --port 38491"
time=2024-04-24T14:30:44.536-04:00 level=INFO source=server.go:389 msg="waiting for llama runner to start responding"
{"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"139817808982016","timestamp":1713983444}
{"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"139817808982016","timestamp":1713983444}
{"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":32,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"139817808982016","timestamp":1713983444,"total_threads":64}
llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from /root/.ollama/models/blobs/sha256-1436d66b69757a245f02d000874c670507949d11ad5c188a623652052c6aa508 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 80
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_0:  561 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 260/32016 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 36.20 GiB (4.51 BPW)
llm_load_print_meta: general.name     = codellama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.55 MiB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:        CPU buffer size =   140.70 MiB
llm_load_tensors:      CUDA0 buffer size = 36930.21 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   640.00 MiB
llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.15 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   324.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    20.01 MiB
llama_new_context_with_model: graph nodes  = 2566
llama_new_context_with_model: graph splits = 2
{"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"139817808982016","timestamp":1713983453}
{"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"139817808982016","timestamp":1713983453}
{"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"139817808982016","timestamp":1713983453}
{"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"63","port":"38491","tid":"139817808982016","timestamp":1713983453}
{"function":"update_slots","level":"INFO","line":1578,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"139817808982016","timestamp":1713983453}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":7726,"tid":"139817808982016","timestamp":1713984223}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":7727,"tid":"139817808982016","timestamp":1713984223}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223}
{"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":7728,"tid":"139817808982016","timestamp":1713984223}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223}
{"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984223}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1809,"msg":"slot progression","n_past":4,"n_past_se":0,"n_prompt_tokens_processed":300,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984223}
{"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":4,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984223}
{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984352}
{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984428}
{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984503}

@kirill-vas commented on GitHub (Apr 26, 2024): Also still experiencing hangs when calling `/api/chat` endpoint. Running on HumanEval benchmark (164 samples), usually fails about 70-80 calls. Requires `ollama serve` restart to recover. Mostly happens with CodeLlama-70b rather than the smaller models (13b, 7b; only tested these). Running v0.1.32 on Ubuntu 22.04.2 with NVIDIA RTX A6000, driver 530.30.02, CUDA 12.1, using a Docker container The specific part where it seems to loop indefinitely is the `update_slots` function with `"msg":"slot context shift"` line from `ollama serve` logs (full logs below): ``` {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984503} ``` Code that calls the endpoint: ``` url = "http://localhost:11434/api/chat" params = { 'model': 'codellama:70b-instruct', 'options': { 'num_ctx': 4096 }, 'messages': [ {'role': 'system', 'content': sys_prompt}, {"role": "user", "content": user_prompt}, ], 'options': { 'seed': 123, 'temperature': 0.2, }, 'stream': False, 'keep_alive': 10 } response = requests.post(url, json=params).json() ``` Full log of the run below: ``` time=2024-04-24T14:30:43.820-04:00 level=INFO source=gpu.go:121 msg="Detecting GPU type" time=2024-04-24T14:30:43.820-04:00 level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" time=2024-04-24T14:30:43.824-04:00 level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1496169119/runners/cuda_v11/libcudart.so.11.0]" time=2024-04-24T14:30:44.105-04:00 level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" time=2024-04-24T14:30:44.105-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-24T14:30:44.387-04:00 level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6" time=2024-04-24T14:30:44.437-04:00 level=INFO source=gpu.go:121 msg="Detecting GPU type" time=2024-04-24T14:30:44.437-04:00 level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" time=2024-04-24T14:30:44.440-04:00 level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1496169119/runners/cuda_v11/libcudart.so.11.0]" time=2024-04-24T14:30:44.440-04:00 level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" time=2024-04-24T14:30:44.440-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-24T14:30:44.505-04:00 level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6" time=2024-04-24T14:30:44.536-04:00 level=INFO source=server.go:127 msg="offload to gpu" reallayers=81 layers=81 required="38351.2 MiB" used="38351.2 MiB" available="45354.3 MiB" kv="640.0 MiB" fulloffload="324.0 MiB" partialoffload="348.0 MiB" time=2024-04-24T14:30:44.536-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-24T14:30:44.536-04:00 level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1496169119/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-1436d66b69757a245f02d000874c670507949d11ad5c188a623652052c6aa508 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --port 38491" time=2024-04-24T14:30:44.536-04:00 level=INFO source=server.go:389 msg="waiting for llama runner to start responding" {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"139817808982016","timestamp":1713983444} {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"139817808982016","timestamp":1713983444} {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":32,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"139817808982016","timestamp":1713983444,"total_threads":64} llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from /root/.ollama/models/blobs/sha256-1436d66b69757a245f02d000874c670507949d11ad5c188a623652052c6aa508 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = codellama llama_model_loader: - kv 2: llama.context_length u32 = 2048 llama_model_loader: - kv 3: llama.embedding_length u32 = 8192 llama_model_loader: - kv 4: llama.block_count u32 = 80 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 64 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q4_0: 561 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 260/32016 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32016 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 68.98 B llm_load_print_meta: model size = 36.20 GiB (4.51 BPW) llm_load_print_meta: general.name = codellama llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.55 MiB llm_load_tensors: offloading 80 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 81/81 layers to GPU llm_load_tensors: CPU buffer size = 140.70 MiB llm_load_tensors: CUDA0 buffer size = 36930.21 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 640.00 MiB llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.15 MiB llama_new_context_with_model: CUDA0 compute buffer size = 324.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB llama_new_context_with_model: graph nodes = 2566 llama_new_context_with_model: graph splits = 2 {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"139817808982016","timestamp":1713983453} {"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"139817808982016","timestamp":1713983453} {"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"139817808982016","timestamp":1713983453} {"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"63","port":"38491","tid":"139817808982016","timestamp":1713983453} {"function":"update_slots","level":"INFO","line":1578,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"139817808982016","timestamp":1713983453} {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":7726,"tid":"139817808982016","timestamp":1713984223} {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223} {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":7727,"tid":"139817808982016","timestamp":1713984223} {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223} {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223} {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":7728,"tid":"139817808982016","timestamp":1713984223} {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223} {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984223} {"function":"update_slots","ga_i":0,"level":"INFO","line":1809,"msg":"slot progression","n_past":4,"n_past_se":0,"n_prompt_tokens_processed":300,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984223} {"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":4,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984223} {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984352} {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984428} {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984503} ```

GiteaMirror commented

2026-04-12 10:48:40 -05:00

@omani commented on GitHub (Apr 26, 2024):

I dont understand the hurry to close this issue without getting enough feedback first. where have you learned this @BruceMacD ? or is this normal procedure in your dev workflow?

@omani commented on GitHub (Apr 26, 2024): I dont understand the hurry to close this issue without getting enough feedback first. where have you learned this @BruceMacD ? or is this normal procedure in your dev workflow?

GiteaMirror commented

2026-04-12 10:48:40 -05:00

@EmanueleLenzi92 commented on GitHub (Apr 27, 2024):

I think I have the same problem. After a few runs, the ollama server crashes and stops to generate text. I'm using windows 11 (wsl ubuntu) and langchain. I have a rtx 4090 and I tried from 0.1.16 to 0.1.19, but all of them have this issue in my case. instead, on a laptop with windows 10 and with an nvidia T500, I don't have this problem.

I still have this problem with 0.1.32 version with rtx 4090 and windows 11 (on wsl ubuntu).
After a few run (less then 10), the Ollama server is stuck and i can't access anymore to "localhost:11434/" unless i kill the process

@EmanueleLenzi92 commented on GitHub (Apr 27, 2024): > I think I have the same problem. After a few runs, the ollama server crashes and stops to generate text. I'm using windows 11 (wsl ubuntu) and langchain. I have a rtx 4090 and I tried from 0.1.16 to 0.1.19, but all of them have this issue in my case. instead, on a laptop with windows 10 and with an nvidia T500, I don't have this problem. I still have this problem with 0.1.32 version with rtx 4090 and windows 11 (on wsl ubuntu). After a few run (less then 10), the Ollama server is stuck and i can't access anymore to "localhost:11434/" unless i kill the process

GiteaMirror commented

2026-04-12 10:48:41 -05:00

@frederick-wang commented on GitHub (Apr 27, 2024):

Got the same bug with A100 on Ubuntu 22.04. ollama version is 0.1.32.

@frederick-wang commented on GitHub (Apr 27, 2024): Got the same bug with A100 on Ubuntu 22.04. ollama version is 0.1.32.

GiteaMirror commented

2026-04-12 10:48:41 -05:00

@frederick-wang commented on GitHub (Apr 27, 2024):

Thanks to everyone for reporting and testing this. Marking this as resolved for now pending any more reports.

‌‌‌Sorry bro @BruceMacD, I found that this issue has not been resolved. I encountered the same stuck issue yesterday (ollama 0.1.32, A100, Ubuntu 22.04) and had to restart to resolve it.

@frederick-wang commented on GitHub (Apr 27, 2024): > Thanks to everyone for reporting and testing this. Marking this as resolved for now pending any more reports. ‌‌‌Sorry bro @BruceMacD, I found that this issue has not been resolved. I encountered the same stuck issue yesterday (ollama 0.1.32, A100, Ubuntu 22.04) and had to restart to resolve it.

GiteaMirror commented

@ckehagioglou commented on GitHub (Apr 28, 2024):

Same bug here. Mac M2 Max Studio hangs after several questions being asked.

@ckehagioglou commented on GitHub (Apr 28, 2024): Same bug here. Mac M2 Max Studio hangs after several questions being asked.

GiteaMirror commented

@BruceMacD commented on GitHub (Apr 28, 2024):

Thanks for the reports, re-opening this.

Couple of questions to help me reproduce:

What models are people seeing this on?
Are you inputting long prompts/context when it gets stuck?

@BruceMacD commented on GitHub (Apr 28, 2024): Thanks for the reports, re-opening this. Couple of questions to help me reproduce: - What models are people seeing this on? - Are you inputting long prompts/context when it gets stuck?

GiteaMirror commented

@airbj31 commented on GitHub (Apr 28, 2024):

Thanks for the reports, re-opening this.

Couple of questions to help me reproduce:

What models are people seeing this on?

Are you inputting long prompts/context when it gets stuck?

mistral, mistral:instruct, llama3
I usually use the LLM model to summarize text. The inputs are normally less than 50 ~500 words.

@airbj31 commented on GitHub (Apr 28, 2024): > Thanks for the reports, re-opening this. > > Couple of questions to help me reproduce: > > * What models are people seeing this on? > * Are you inputting long prompts/context when it gets stuck? 1. mistral, mistral:instruct, llama3 2. I usually use the LLM model to summarize text. The inputs are normally less than 50 ~500 words.

GiteaMirror commented

2026-04-12 10:48:43 -05:00

@EmanueleLenzi92 commented on GitHub (Apr 28, 2024):

Thanks for the reports, re-opening this.

Couple of questions to help me reproduce:

What models are people seeing this on?

Are you inputting long prompts/context when it gets stuck?

llama2 7b and 13b, llama3 8b
I use about 300 words in the prompts

@EmanueleLenzi92 commented on GitHub (Apr 28, 2024): > Thanks for the reports, re-opening this. > > Couple of questions to help me reproduce: > > * What models are people seeing this on? > * Are you inputting long prompts/context when it gets stuck? 1. llama2 7b and 13b, llama3 8b 2. I use about 300 words in the prompts

GiteaMirror commented

@dhiltgen commented on GitHub (Apr 28, 2024):

The pre-release for 0.1.33 is available now, which should resolve these long context hang/loop problems.

@dhiltgen commented on GitHub (Apr 28, 2024): The pre-release for [0.1.33](https://github.com/ollama/ollama/releases) is available now, which should resolve these long context hang/loop problems.

GiteaMirror commented

2026-04-12 10:48:43 -05:00

@syrom commented on GitHub (Apr 28, 2024):

@dhiltgen Great news, thank you: will try asap after I have the update installed.
FYI, the situation has alread improved considerably - but hangups still there are with 0.1.32.
I experienced a hangup after having Ollama / Mixtral churn thru a large text file for > 12 h, extracting semanting information from it.
Setup: M1 Powerbook with 64 GB RAM and Ollama 0.1.32.
The text had 623 chunks with 1000 characters each (plus another ca. 400 characters prompt size) - and the hangup occured after processing 517 of these chunks.

@syrom commented on GitHub (Apr 28, 2024): @dhiltgen Great news, thank you: will try asap after I have the update installed. FYI, the situation has alread improved considerably - but hangups still there are with 0.1.32. I experienced a hangup after having Ollama / Mixtral churn thru a large text file for > 12 h, extracting semanting information from it. Setup: M1 Powerbook with 64 GB RAM and Ollama 0.1.32. The text had 623 chunks with 1000 characters each (plus another ca. 400 characters prompt size) - and the hangup occured after processing 517 of these chunks.

GiteaMirror commented

2026-04-12 10:48:43 -05:00

@WeirdCarrotMonster commented on GitHub (Apr 28, 2024):

I can still encounter this problem on 0.1.33: ollama gets stuck after 15 minutes of embeddings processing (using nomic-embed-text). Last log lines:

ollama[8937]: [GIN] 2024/04/28 - 20:12:17 | 200 |  155.784253ms |  100.112.67.113 | POST     "/api/embeddings"
ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31361,"tid":"140302648860672","timestamp":1714335137}
ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41614,"status":200,"tid":"140301227372544","timestamp":1714335137}
ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31362,"tid":"140302648860672","timestamp":1714335137}
ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41614,"status":200,"tid":"140301227372544","timestamp":1714335137}
ollama[9198]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":31363,"tid":"140302648860672","timestamp":1714335137}
ollama[9198]: {"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":31363,"tid":"140302648860672","timestamp":1714335137}
ollama[9198]: {"function":"update_slots","level":"INFO","line":1640,"msg":"slot released","n_cache_tokens":103,"n_ctx":2048,"n_past":103,"n_system_tokens":0,"slot_id":0,"task_id":31363,"tid":"140302648860672","timestamp":1714335137,"truncated":false}
ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/embedding","remote_addr":"127.0.0.1","remote_port":41618,"status":200,"tid":"140301247180800","timestamp":1714335137}
ollama[8937]: [GIN] 2024/04/28 - 20:12:17 | 200 |   49.744207ms |  100.112.67.113 | POST     "/api/embeddings"
ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31366,"tid":"140302648860672","timestamp":1714335137}
ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41618,"status":200,"tid":"140301247180800","timestamp":1714335137}
ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31367,"tid":"140302648860672","timestamp":1714335137}
ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41618,"status":200,"tid":"140301247180800","timestamp":1714335137}
ollama[9198]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":31368,"tid":"140302648860672","timestamp":1714335137}

GPU: NVIDIA GeForce RTX 3060
Driver version: 550.54.14
CUDA Version: 12.4

@WeirdCarrotMonster commented on GitHub (Apr 28, 2024): I can still encounter this problem on 0.1.33: ollama gets stuck after 15 minutes of embeddings processing (using nomic-embed-text). Last log lines: ``` ollama[8937]: [GIN] 2024/04/28 - 20:12:17 | 200 | 155.784253ms | 100.112.67.113 | POST "/api/embeddings" ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31361,"tid":"140302648860672","timestamp":1714335137} ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41614,"status":200,"tid":"140301227372544","timestamp":1714335137} ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31362,"tid":"140302648860672","timestamp":1714335137} ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41614,"status":200,"tid":"140301227372544","timestamp":1714335137} ollama[9198]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":31363,"tid":"140302648860672","timestamp":1714335137} ollama[9198]: {"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":31363,"tid":"140302648860672","timestamp":1714335137} ollama[9198]: {"function":"update_slots","level":"INFO","line":1640,"msg":"slot released","n_cache_tokens":103,"n_ctx":2048,"n_past":103,"n_system_tokens":0,"slot_id":0,"task_id":31363,"tid":"140302648860672","timestamp":1714335137,"truncated":false} ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/embedding","remote_addr":"127.0.0.1","remote_port":41618,"status":200,"tid":"140301247180800","timestamp":1714335137} ollama[8937]: [GIN] 2024/04/28 - 20:12:17 | 200 | 49.744207ms | 100.112.67.113 | POST "/api/embeddings" ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31366,"tid":"140302648860672","timestamp":1714335137} ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41618,"status":200,"tid":"140301247180800","timestamp":1714335137} ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31367,"tid":"140302648860672","timestamp":1714335137} ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41618,"status":200,"tid":"140301247180800","timestamp":1714335137} ollama[9198]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":31368,"tid":"140302648860672","timestamp":1714335137} ``` GPU: NVIDIA GeForce RTX 3060 Driver version: 550.54.14 CUDA Version: 12.4

GiteaMirror commented

2026-04-12 10:48:44 -05:00

@janis-inzpire commented on GitHub (May 3, 2024):

Just to add to this a bit - looks like we are experiencing same issue.
Running llava model, gets stuck every 15 - 20 minutes. Sometimes gets stuck after just 4 requests.
We are using API to call the endpoint. running version 0.1.33 though a docker container.

@janis-inzpire commented on GitHub (May 3, 2024): Just to add to this a bit - looks like we are experiencing same issue. Running llava model, gets stuck every 15 - 20 minutes. Sometimes gets stuck after just 4 requests. We are using API to call the endpoint. running version 0.1.33 though a docker container.

GiteaMirror commented

2026-04-12 10:48:44 -05:00

@ukrolelo commented on GitHub (May 5, 2024):

+1 stuck on question in different language.

You 05/05/2024 18:18
то ти розуміеш украйську мову?



profile
Llama3:Latest 
Корисно запитання!

Я можу зрозуміти основні фрази і слова на українській мові, оскільки я вивчав великий корпус текстів різних мов, включно з українською.


Однак, моя комп'ютерна профікувність українській мові може бути обмеженою порівняно з профікувністю кореспондента-українця. Я можу зробити помилки в граматиці, морф

remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939}
{"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":832,"tid":"139900887961600","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939}
{"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":833,"tid":"139900887961600","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939}
{"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":834,"tid":"139900887961600","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939}
{"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":835,"tid":"139900887961600","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939}
{"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":836,"tid":"139900887961600","timestamp":1714925939}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1812,"msg":"slot progression","n_past":1085,"n_past_se":0,"n_prompt_tokens_processed":307,"slot_id":0,"task_id":836,"tid":"139900887961600","timestamp":1714925939}
{"function":"update_slots","level":"INFO","line":1839,"msg":"kv cache rm [p0, end)","p0":1085,"slot_id":0,"task_id":836,"tid":"139900887961600","timestamp":1714925939}

@ukrolelo commented on GitHub (May 5, 2024): +1 stuck on question in different language. ``` You 05/05/2024 18:18 то ти розуміеш украйську мову? profile Llama3:Latest Корисно запитання! Я можу зрозуміти основні фрази і слова на українській мові, оскільки я вивчав великий корпус текстів різних мов, включно з українською. Однак, моя комп'ютерна профікувність українській мові може бути обмеженою порівняно з профікувністю кореспондента-українця. Я можу зробити помилки в граматиці, морф ``` ``` remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939} {"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":832,"tid":"139900887961600","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939} {"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":833,"tid":"139900887961600","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939} {"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":834,"tid":"139900887961600","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939} {"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":835,"tid":"139900887961600","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939} {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":836,"tid":"139900887961600","timestamp":1714925939} {"function":"update_slots","ga_i":0,"level":"INFO","line":1812,"msg":"slot progression","n_past":1085,"n_past_se":0,"n_prompt_tokens_processed":307,"slot_id":0,"task_id":836,"tid":"139900887961600","timestamp":1714925939} {"function":"update_slots","level":"INFO","line":1839,"msg":"kv cache rm [p0, end)","p0":1085,"slot_id":0,"task_id":836,"tid":"139900887961600","timestamp":1714925939} ```

GiteaMirror commented

2026-04-12 10:48:44 -05:00

@syrom commented on GitHub (May 6, 2024):

A quick feedback: from my perspective, the bug is solved as far as Ollama running on Mac Silicon is concerned. I was never able to to process more than ca. 120 text chunks of a size of 1.000 characters in one go on an M1 Pro Mac. Now, with the upate to 0.1.33, the computer ran for 24 h nonstop, processing ca. 630 text chunks of a larger document to extract information from it.... and did so to the very end.
Simply: Thanks !

@syrom commented on GitHub (May 6, 2024): A quick feedback: from my perspective, the bug is solved as far as Ollama running on Mac Silicon is concerned. I was never able to to process more than ca. 120 text chunks of a size of 1.000 characters in one go on an M1 Pro Mac. Now, with the upate to 0.1.33, the computer ran for 24 h nonstop, processing ca. 630 text chunks of a larger document to extract information from it.... and did so to the very end. Simply: Thanks !

GiteaMirror commented

2026-04-12 10:48:45 -05:00

@maciejmajek commented on GitHub (May 9, 2024):

Still happens to me with llava models @ ollama v0.1.34
Interestingly, Ollama only freezes up when I use the /chat endpoint with both image and text data. It works fine when only text is sent.
I've noticed that the problem gets worse when I hit the /chat endpoint with multiple prompts at once using Ollama's queuing system. It tends to hang after about 30 seconds...

Setup:
2x RTX 4090
13900k

logs:

Last succesful chat call

[GIN] 2024/05/09 - 18:36:57 | 200 | 8.140684188s | 10.244.163.252 | POST "/api/chat"
time=2024-05-09T18:36:57.971+02:00 level=DEBUG source=sched.go:294 msg="context for request finished"
time=2024-05-09T18:36:57.971+02:00 level=DEBUG source=sched.go:232 msg="runner with non-zero duration has gone idle, adding timer" model=/usr/share/ollama/.ollama/models/blobs/sha256-1834da0de12e8d8c4cce928b0020f25311d5fca5ae77be8fc9039f8bcda1833a duration=5m0s
time=2024-05-09T18:36:57.971+02:00 level=DEBUG source=sched.go:248 msg="after processing request finished event" model=/usr/share/ollama/.ollama/models/blobs/sha256-1834da0de12e8d8c4cce928b0020f25311d5fca5ae77be8fc9039f8bcda1833a refCount=0
time=2024-05-09T18:36:59.457+02:00 level=DEBUG source=sched.go:435 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-1834da0de12e8d8c4cce928b0020f25311d5fca5ae77be8fc9039f8bcda1833a
{"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":850,"tid":"140043376336896","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619}
{"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":851,"tid":"140043376336896","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619}
{"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":852,"tid":"140043376336896","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619}
time=2024-05-09T18:36:59.594+02:00 level=DEBUG source=prompt.go:172 msg="prompt now fits in context window" required=1988 window=2048
time=2024-05-09T18:36:59.595+02:00 level=DEBUG source=routes.go:1241 msg="chat handler" prompt="<|im_start|>system\n<|im_end|>\n<|im_start|>user\n- blah blah <|im_start|>system\n<|im_end|>\n<|im_start|>user\n[img-0] [img-1] input: two consecutive images blah blah <|im_end|>\n<|im_start|>assistant\n" images=2
time=2024-05-09T18:36:59.595+02:00 level=DEBUG source=server.go:591 msg="setting token limit to 10x num_ctx" num_ctx=2048 num_predict=20480
{"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":853,"tid":"140043376336896","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49272,"status":200,"tid":"140038952120320","timestamp":1715272619}
{"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":854,"tid":"140043376336896","timestamp":1715272619}
{"function":"update_slots","level":"INFO","line":1837,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":854,"tid":"140043376336896","timestamp":1715272619}

@maciejmajek commented on GitHub (May 9, 2024): Still happens to me with llava models @ ollama v0.1.34 Interestingly, Ollama only freezes up when I use the /chat endpoint with both image and text data. It works fine when only text is sent. I've noticed that the problem gets worse when I hit the /chat endpoint with multiple prompts at once using Ollama's queuing system. It tends to hang after about 30 seconds... Setup: 2x RTX 4090 13900k logs: <details><summary>Last succesful chat call </summary> <p> [GIN] 2024/05/09 - 18:36:57 | 200 | 8.140684188s | 10.244.163.252 | POST "/api/chat" time=2024-05-09T18:36:57.971+02:00 level=DEBUG source=sched.go:294 msg="context for request finished" time=2024-05-09T18:36:57.971+02:00 level=DEBUG source=sched.go:232 msg="runner with non-zero duration has gone idle, adding timer" model=/usr/share/ollama/.ollama/models/blobs/sha256-1834da0de12e8d8c4cce928b0020f25311d5fca5ae77be8fc9039f8bcda1833a duration=5m0s time=2024-05-09T18:36:57.971+02:00 level=DEBUG source=sched.go:248 msg="after processing request finished event" model=/usr/share/ollama/.ollama/models/blobs/sha256-1834da0de12e8d8c4cce928b0020f25311d5fca5ae77be8fc9039f8bcda1833a refCount=0 time=2024-05-09T18:36:59.457+02:00 level=DEBUG source=sched.go:435 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-1834da0de12e8d8c4cce928b0020f25311d5fca5ae77be8fc9039f8bcda1833a {"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":850,"tid":"140043376336896","timestamp":1715272619} {"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619} {"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":851,"tid":"140043376336896","timestamp":1715272619} {"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619} {"function":"log_server_request","level":"INFO","line":2735,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619} {"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":852,"tid":"140043376336896","timestamp":1715272619} {"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619} {"function":"log_server_request","level":"INFO","line":2735,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619} time=2024-05-09T18:36:59.594+02:00 level=DEBUG source=prompt.go:172 msg="prompt now fits in context window" required=1988 window=2048 time=2024-05-09T18:36:59.595+02:00 level=DEBUG source=routes.go:1241 msg="chat handler" prompt="<|im_start|>system\n<|im_end|>\n<|im_start|>user\n- blah blah <|im_start|>system\n<|im_end|>\n<|im_start|>user\n[img-0] [img-1] input: two consecutive images blah blah <|im_end|>\n<|im_start|>assistant\n" images=2 time=2024-05-09T18:36:59.595+02:00 level=DEBUG source=server.go:591 msg="setting token limit to 10x num_ctx" num_ctx=2048 num_predict=20480 {"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":853,"tid":"140043376336896","timestamp":1715272619} {"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49272,"status":200,"tid":"140038952120320","timestamp":1715272619} {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":854,"tid":"140043376336896","timestamp":1715272619} {"function":"update_slots","level":"INFO","line":1837,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":854,"tid":"140043376336896","timestamp":1715272619} </p> </details>

GiteaMirror commented

2026-04-12 10:48:45 -05:00

@mironnn commented on GitHub (May 16, 2024):

The same. still, have issues
ollama version 0.1.38
RTXA6000
Llama3:70b
Hangs on <10 requests

@mironnn commented on GitHub (May 16, 2024): The same. still, have issues ollama version 0.1.38 RTXA6000 Llama3:70b Hangs on <10 requests

GiteaMirror commented

2026-04-12 10:48:46 -05:00

@quaintdev commented on GitHub (May 23, 2024):

Has happened with multiple models for me. My prompt is usually just a line. I have seen that it happens if I kept it idle for sometime. When I come back the responses are stuck. I am running on CPU. The first prompt after starting ollama serve command always gets quick response. I am on 0.1.37

Edit: Happens on 0.1.38 too. I always see something like below when this happens

With easyllama and I don't see such issue.

@quaintdev commented on GitHub (May 23, 2024): Has happened with multiple models for me. My prompt is usually just a line. I have seen that it happens if I kept it idle for sometime. When I come back the responses are stuck. I am running on CPU. The first prompt after starting `ollama serve` command always gets quick response. I am on `0.1.37` Edit: Happens on `0.1.38` too. I always see something like below when this happens ![image](https://github.com/ollama/ollama/assets/59229571/2505451e-2b71-4326-b3a6-9e83b0470a8b) With easyllama and I don't see such issue.

GiteaMirror commented

2026-04-12 10:48:46 -05:00

@sammcj commented on GitHub (May 26, 2024):

Hi all, give this fix a go: https://github.com/ollama/ollama/issues/4604#issuecomment-2130436000

@sammcj commented on GitHub (May 26, 2024): Hi all, give this fix a go: https://github.com/ollama/ollama/issues/4604#issuecomment-2130436000

GiteaMirror commented

2026-04-12 10:48:46 -05:00

@quaintdev commented on GitHub (May 26, 2024):

I'm not using Docker so I don't think this fix is applicable to me.

@quaintdev commented on GitHub (May 26, 2024): I'm not using Docker so I don't think this fix is applicable to me.

GiteaMirror commented

2026-04-12 10:48:47 -05:00

@jak4 commented on GitHub (May 27, 2024):

I'm experiencing a similar issue. I'm running on a virtualized VM with a Tesla P40. After booting the VM everything works, but after a while, when the server idles, it stops working. Neither requests from a frontend nor from the cli (e.g. ollama run llama3) work. With the cli it just never starts up. The log files dont show anything suspicous. "service ollama restart" does nothing. The only thing that is maybe not aligned is, that I'm having CUDA 12.2 installed but the runner is using v11.
Edit: Version 0.1.38

Edit: Also happens on Version 0.1.39. What is maybe interessting is, that this happens regardless of running queries against the LLM or not. After booting the VM and not running any query for an unspecified amount of time, but less than 2 h, ollama becomes/is unresponsive. It seems the model gets loaded, but doesnt finish. After loading the model up to a certain point, with LLama 3 to around 4800 MiB of GPU RAM, the loading slows to a crawl and the GPU RAM usage increases at 2 MiB for every few (two seconds?). At some point it increases with 6 MiB per every few seconds (at 4922 MiB), and then stops completly (at 4934 MiB). After a while the process stops and the GPU RAM is completly empty again.

When comparing a working ollama instance to a non-responsive instance the load speed for the model is way higher when everything works out. The model I used for this testing uses 4934 MiB when fully loaded. Which tracks with the number above.

@jak4 commented on GitHub (May 27, 2024): I'm experiencing a similar issue. I'm running on a virtualized VM with a Tesla P40. After booting the VM everything works, but after a while, when the server idles, it stops working. Neither requests from a frontend nor from the cli (e.g. ollama run llama3) work. With the cli it just never starts up. The log files dont show anything suspicous. "service ollama restart" does nothing. The only thing that is maybe not aligned is, that I'm having CUDA 12.2 installed but the runner is using v11. Edit: Version 0.1.38 Edit: Also happens on Version 0.1.39. What is maybe interessting is, that this happens regardless of running queries against the LLM or not. After booting the VM and not running any query for an unspecified amount of time, but less than 2 h, ollama becomes/is unresponsive. It seems the model gets loaded, but doesnt finish. After loading the model up to a certain point, with LLama 3 to around 4800 MiB of GPU RAM, the loading slows to a crawl and the GPU RAM usage increases at 2 MiB for every few (two seconds?). At some point it increases with 6 MiB per every few seconds (at 4922 MiB), and then stops completly (at 4934 MiB). After a while the process stops and the GPU RAM is completly empty again. When comparing a working ollama instance to a non-responsive instance the load speed for the model is way higher when everything works out. The model I used for this testing uses 4934 MiB when fully loaded. Which tracks with the number above.

GiteaMirror commented

2026-04-12 10:48:47 -05:00

@blubbsy commented on GitHub (May 28, 2024):

Thanks for the reports, re-opening this.

Couple of questions to help me reproduce:
* What models are people seeing this on?

* Are you inputting long prompts/context when it gets stuck?

i'm seeing the same problem on Windows. I'm using llava:v1.6 and pass the images through bind(...) as base64 and then invoke the promp. works fine for few prompts and then stops.
i'm at the moment checking if maybe something else could be wrong, but as my experience fits to what i read here i want to mention it.

@blubbsy commented on GitHub (May 28, 2024): > Thanks for the reports, re-opening this. > > Couple of questions to help me reproduce: > > * What models are people seeing this on? > > * Are you inputting long prompts/context when it gets stuck? i'm seeing the same problem on Windows. I'm using llava:v1.6 and pass the images through bind(...) as base64 and then invoke the promp. works fine for few prompts and then stops. i'm at the moment checking if maybe something else could be wrong, but as my experience fits to what i read here i want to mention it.

GiteaMirror commented

@jak4 commented on GitHub (May 31, 2024):

I'm seeing the same issues with vLLM which indicates a problem with some underlying libraries, e.g. torch or maybe something CUDA. What is fascinating is, that this has apparently nothing to do with time between requests or even "going into sleep mode" since I managed to perform a query to vLLM which was working perfectly with around 15 tokens per second and then slowed to a crawl with 0.1 tokens per second. So even while the generation is running this stuff happens.

@jak4 commented on GitHub (May 31, 2024): I'm seeing the same issues with vLLM which indicates a problem with some underlying libraries, e.g. torch or maybe something CUDA. What is fascinating is, that this has apparently nothing to do with time between requests or even "going into sleep mode" since I managed to perform a query to vLLM which was working perfectly with around 15 tokens per second and then slowed to a crawl with 0.1 tokens per second. So even while the generation is running this stuff happens.

GiteaMirror commented

@jak4 commented on GitHub (Jun 1, 2024):

I have resolved my issue. It had nothing todo with ollama, vllm, or any other part of the software stack. It was a LICENSING issue. I simply forgot to aquire a license for the vGPU the VM was using. So after a while the nvidia driver degraded the performance of the vGPU to become basically unusable.

@jak4 commented on GitHub (Jun 1, 2024): I have resolved my issue. It had nothing todo with ollama, vllm, or any other part of the software stack. It was a LICENSING issue. I simply forgot to aquire a license for the vGPU the VM was using. So after a while the nvidia driver degraded the performance of the vGPU to become basically unusable.

GiteaMirror commented

@mchiang0610 commented on GitHub (Jun 1, 2024):

@jak4 thank you for letting us know about this. May I ask what was the VM provider so we know in the future what to lookout for?

@mchiang0610 commented on GitHub (Jun 1, 2024): @jak4 thank you for letting us know about this. May I ask what was the VM provider so we know in the future what to lookout for?

GiteaMirror commented