[GH-ISSUE #1863] Ollama stuck after few runs #47576

Closed
opened 2026-04-28 04:16:31 -05:00 by GiteaMirror · 123 comments
Owner

Originally created by @jadhvank on GitHub (Jan 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1863

Originally assigned to: @jessegross on GitHub.

I updated Ollama from 0.1.16 to 0.1.18 and encountered the issue.
I am using python to use LLM models with Ollama and Langchain on Linux server(4 x A100 GPU).
There are 5,000 prompts to ask and get the results from LLM.
With Ollama 0.1.17, the Ollama server stops in 1 or 2 days.
Now it hung in 10 minutes.

image

This is the Ollama server message when it stops running.
It happens more when Phi 2 runs then when Mixtral runs
After the freeze, exit the server and run it again, then the prompt and the LLM answer is successfully received.

The environment
Linux: Ubuntu 22.04.3 LTS
python: 3.10.12
Ollama: 0.1.18
Langchain: 0.0.274
Mixtral: latest
Phi 2: latest
GPU: NVIDIA A100-SXM4-80GB x 4
Prompt size: ~10K
# of Prompts: 5K
image

Read these articles, https://github.com/jmorganca/ollama/issues/1853, https://github.com/jmorganca/ollama/issues/1688
But none of them are works here.

Also, if there are any way to install previous version of Ollama (0.1.16), let me know

Originally created by @jadhvank on GitHub (Jan 9, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1863 Originally assigned to: @jessegross on GitHub. I updated Ollama from 0.1.16 to 0.1.18 and encountered the issue. I am using python to use LLM models with Ollama and Langchain on Linux server(4 x A100 GPU). There are 5,000 prompts to ask and get the results from LLM. With Ollama 0.1.17, the Ollama server stops in 1 or 2 days. Now it hung in 10 minutes. ![image](https://github.com/jmorganca/ollama/assets/11309219/622a494a-0378-4ca8-bb60-c8526626ae66) This is the Ollama server message when it stops running. It happens more when Phi 2 runs then when Mixtral runs After the freeze, exit the server and run it again, then the prompt and the LLM answer is successfully received. The environment Linux: Ubuntu 22.04.3 LTS python: 3.10.12 Ollama: 0.1.18 Langchain: 0.0.274 Mixtral: latest Phi 2: latest GPU: NVIDIA A100-SXM4-80GB x 4 Prompt size: ~10K \# of Prompts: 5K ![image](https://github.com/jmorganca/ollama/assets/11309219/92f0cd9d-59b1-4e66-bc76-fc71a1914fee) Read these articles, https://github.com/jmorganca/ollama/issues/1853, https://github.com/jmorganca/ollama/issues/1688 But none of them are works here. Also, if there are any way to install previous version of Ollama (0.1.16), let me know
GiteaMirror added the performancebug labels 2026-04-28 04:16:31 -05:00
Author
Owner

@Mahmuod1 commented on GitHub (Jan 9, 2024):

@jadhvank
for previous version you can install the docker ollama/hub

<!-- gh-comment-id:1882838319 --> @Mahmuod1 commented on GitHub (Jan 9, 2024): @jadhvank for previous version you can install the docker [ollama/hub](https://hub.docker.com/r/ollama/ollama)
Author
Owner

@jmorganca commented on GitHub (Jan 9, 2024):

Hi @jadhvank sorry you hit this, looking into it

In the meantime an easy way to install 0.1.17 is

curl https://ollama.ai/install.sh | sed 's#https://ollama.ai/download#https://github.com/jmorganca/ollama/releases/download/v0.1.17#' | sh
<!-- gh-comment-id:1883622279 --> @jmorganca commented on GitHub (Jan 9, 2024): Hi @jadhvank sorry you hit this, looking into it In the meantime an easy way to install `0.1.17` is ``` curl https://ollama.ai/install.sh | sed 's#https://ollama.ai/download#https://github.com/jmorganca/ollama/releases/download/v0.1.17#' | sh ```
Author
Owner

@iplayfast commented on GitHub (Jan 9, 2024):

I think this is realted to https://github.com/jmorganca/ollama/issues/1691

<!-- gh-comment-id:1883696040 --> @iplayfast commented on GitHub (Jan 9, 2024): I think this is realted to https://github.com/jmorganca/ollama/issues/1691
Author
Owner

@IAMBUDE commented on GitHub (Jan 9, 2024):

I also experience this issue with 2x 3090 GPUs. The server just stops generating.

<!-- gh-comment-id:1883700178 --> @IAMBUDE commented on GitHub (Jan 9, 2024): I also experience this issue with 2x 3090 GPUs. The server just stops generating.
Author
Owner

@jadhvank commented on GitHub (Jan 10, 2024):

I updated the Ollama to version 0.1.19 and the stuck happened again in 5 min.
Removed the 0.1.19 and installed 0.1.16.
The stuck occurred after 6 hours (better!)

<!-- gh-comment-id:1885934392 --> @jadhvank commented on GitHub (Jan 10, 2024): I updated the Ollama to version 0.1.19 and the stuck happened again in 5 min. Removed the 0.1.19 and installed 0.1.16. The stuck occurred after 6 hours (better!)
Author
Owner

@EmanueleLenzi92 commented on GitHub (Jan 17, 2024):

I think I have the same problem. After a few runs, the ollama server crashes and stops to generate text. I'm using windows 11 (wsl ubuntu) and langchain. I have a rtx 4090 and I tried from 0.1.16 to 0.1.19, but all of them have this issue in my case.
instead, on a laptop with windows 10 and with an nvidia T500, I don't have this problem.

<!-- gh-comment-id:1896326187 --> @EmanueleLenzi92 commented on GitHub (Jan 17, 2024): I think I have the same problem. After a few runs, the ollama server crashes and stops to generate text. I'm using windows 11 (wsl ubuntu) and langchain. I have a rtx 4090 and I tried from 0.1.16 to 0.1.19, but all of them have this issue in my case. instead, on a laptop with windows 10 and with an nvidia T500, I don't have this problem.
Author
Owner

@hml-github commented on GitHub (Jan 18, 2024):

me too, same problem, stop generation after random time.

<!-- gh-comment-id:1898096057 --> @hml-github commented on GitHub (Jan 18, 2024): me too, same problem, stop generation after random time.
Author
Owner

@amirdeljouyi commented on GitHub (Jan 24, 2024):

Similarly, it halts after approximately 100 iterations.

<!-- gh-comment-id:1908372592 --> @amirdeljouyi commented on GitHub (Jan 24, 2024): Similarly, it halts after approximately 100 iterations.
Author
Owner

@mchiang0610 commented on GitHub (Jan 27, 2024):

wanted to see if anyone is still running into this issue with ollama v0.1.22

<!-- gh-comment-id:1912909144 --> @mchiang0610 commented on GitHub (Jan 27, 2024): wanted to see if anyone is still running into this issue with ollama v0.1.22
Author
Owner

@EmanueleLenzi92 commented on GitHub (Feb 2, 2024):

wanted to see if anyone is still running into this issue with ollama v0.1.22

I confirm i still have this problem with 0.1.22

<!-- gh-comment-id:1923494218 --> @EmanueleLenzi92 commented on GitHub (Feb 2, 2024): > wanted to see if anyone is still running into this issue with ollama v0.1.22 I confirm i still have this problem with 0.1.22
Author
Owner

@julienlesbegueriesperso commented on GitHub (Feb 2, 2024):

wanted to see if anyone is still running into this issue with ollama v0.1.22

I confirm i still have this problem with 0.1.22

I confirm also (on MacBook Pro 2,6 GHz Intel Core i7 and on a cpu-only server)

<!-- gh-comment-id:1923499959 --> @julienlesbegueriesperso commented on GitHub (Feb 2, 2024): > > wanted to see if anyone is still running into this issue with ollama v0.1.22 > > I confirm i still have this problem with 0.1.22 I confirm also (on MacBook Pro 2,6 GHz Intel Core i7 and on a cpu-only server)
Author
Owner

@Simaky commented on GitHub (Feb 2, 2024):

I could confirm that issue with 0.1.23 (on WSL)
I ran the script with 100 requests and saw in the logs that 6/10 requests were frozen and never received a response :(

<!-- gh-comment-id:1924433128 --> @Simaky commented on GitHub (Feb 2, 2024): I could confirm that issue with 0.1.23 (on WSL) I ran the script with 100 requests and saw in the logs that 6/10 requests were frozen and never received a response :(
Author
Owner

@svilupp commented on GitHub (Feb 8, 2024):

+1

I run a community leaderboard for Julia code generation and I've run 10s of thousands of samples in the past (with failures, but not unreasonable).
Recently, I've updated and haven't been able to run anything anymore... Same machine/setup

Behavior:

  • I updated to 0.1.2x (not sure) and I couldn't run more than a few samples of Qwen-72b FP16. It kept freezing, it would stop using GPU, etc
  • Updated to 0.1.23 -> cannot run more than 1-2 samples and it just hangs (need to kill process)
  • Tried 0.1.24 pre-release -> same as above
  • Tried 0.1.19 -> better, I could run c. 100 samples and then it froze
  • I'm now on 0.1.16 and it's been a few hundred samples and it's still going. So clearly the issue didn't exist back then!

Workload:

  • This benchmark: https://github.com/svilupp/Julia-LLM-Leaderboard
  • Different models, prompts, test cases, but never have any empty or unexpected inputs (the workload hasn't changed, it just doesn't run anymore). I'm even running models that ran fine previously
  • Using ollama via /api/chat endpoint

System:

  • Debian-based x86
  • 4x RTX 4090

Ollama header:

2024/02/08 17:20:55 images.go:857: INFO total blobs: 140
2024/02/08 17:20:55 images.go:864: INFO total unused blobs removed: 0
2024/02/08 17:20:55 routes.go:950: INFO Listening on 127.0.0.1:11434 (version 0.1.22)
2024/02/08 17:20:55 payload_common.go:106: INFO Extracting dynamic libraries...
2024/02/08 17:20:57 payload_common.go:145: INFO Dynamic LLM libraries [rocm_v6 cpu rocm_v5 cuda_v11 cpu_avx2 cpu_avx]
2024/02/08 17:20:57 gpu.go:94: INFO Detecting GPU type
2024/02/08 17:20:57 gpu.go:236: INFO Searching for GPU management library libnvidia-ml.so
2024/02/08 17:20:57 gpu.go:282: INFO Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.86.10]
2024/02/08 17:20:57 gpu.go:99: INFO Nvidia GPU detected
2024/02/08 17:20:57 gpu.go:140: INFO CUDA Compute Capability detected: 8.9

<!-- gh-comment-id:1934809250 --> @svilupp commented on GitHub (Feb 8, 2024): +1 I run a community leaderboard for Julia code generation and I've run 10s of thousands of samples in the past (with failures, but not unreasonable). Recently, I've updated and haven't been able to run anything anymore... Same machine/setup **Behavior:** - I updated to 0.1.2x (not sure) and I couldn't run more than a few samples of Qwen-72b FP16. It kept freezing, it would stop using GPU, etc - Updated to 0.1.23 -> cannot run more than 1-2 samples and it just hangs (need to kill process) - Tried 0.1.24 pre-release -> same as above - Tried 0.1.19 -> better, I could run c. 100 samples and then it froze - I'm now on 0.1.16 and it's been a few hundred samples and it's still going. So clearly the issue didn't exist back then! **Workload:** - This benchmark: https://github.com/svilupp/Julia-LLM-Leaderboard - Different models, prompts, test cases, but never have any empty or unexpected inputs (the workload hasn't changed, it just doesn't run anymore). I'm even running models that ran fine previously - Using ollama via /api/chat endpoint **System:** - Debian-based x86 - 4x RTX 4090 **Ollama header:** > 2024/02/08 17:20:55 images.go:857: INFO total blobs: 140 > 2024/02/08 17:20:55 images.go:864: INFO total unused blobs removed: 0 > 2024/02/08 17:20:55 routes.go:950: INFO Listening on 127.0.0.1:11434 (version 0.1.22) > 2024/02/08 17:20:55 payload_common.go:106: INFO Extracting dynamic libraries... > 2024/02/08 17:20:57 payload_common.go:145: INFO Dynamic LLM libraries [rocm_v6 cpu rocm_v5 cuda_v11 cpu_avx2 cpu_avx] > 2024/02/08 17:20:57 gpu.go:94: INFO Detecting GPU type > 2024/02/08 17:20:57 gpu.go:236: INFO Searching for GPU management library libnvidia-ml.so > 2024/02/08 17:20:57 gpu.go:282: INFO Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.86.10] > 2024/02/08 17:20:57 gpu.go:99: INFO Nvidia GPU detected > 2024/02/08 17:20:57 gpu.go:140: INFO CUDA Compute Capability detected: 8.9
Author
Owner

@wac81 commented on GitHub (Feb 13, 2024):

Hi @jadhvank sorry you hit this, looking into it嗨,抱歉你碰到了这个,正在调查它

In the meantime an easy way to install 0.1.17 is同时安装 0.1.17 的简单方法是

curl https://ollama.ai/install.sh | sed 's#https://ollama.ai/download#https://github.com/jmorganca/ollama/releases/download/v0.1.17#' | sh

Could it have anything to do with GPU memory management?

My experience is that if you use a 12g gpu to load the llama13b model, the output will basically get stuck if it exceeds 200 tokens.

<!-- gh-comment-id:1941745324 --> @wac81 commented on GitHub (Feb 13, 2024): > Hi @jadhvank sorry you hit this, looking into it嗨,抱歉你碰到了这个,正在调查它 > > In the meantime an easy way to install `0.1.17` is同时安装 `0.1.17` 的简单方法是 > > ``` > curl https://ollama.ai/install.sh | sed 's#https://ollama.ai/download#https://github.com/jmorganca/ollama/releases/download/v0.1.17#' | sh > ``` Could it have anything to do with GPU memory management? My experience is that if you use a 12g gpu to load the llama13b model, the output will basically get stuck if it exceeds 200 tokens.
Author
Owner

@jmorganca commented on GitHub (Feb 20, 2024):

This should be fixed as of 0.1.24. Please let me know if that isn't the case, and we'll re-open this (and get it fixed once and for all 😊). Sorry about this!

<!-- gh-comment-id:1953352834 --> @jmorganca commented on GitHub (Feb 20, 2024): This should be fixed as of 0.1.24. Please let me know if that isn't the case, and we'll re-open this (and get it fixed once and for all 😊). Sorry about this!
Author
Owner

@StrikerRUS commented on GitHub (Feb 20, 2024):

@jmorganca Unfortunately, it isn't fixed in 0.1.25.

OS: Ubuntu 22.04.2 LTS
GPU: NVIDIA RTX A6000 (Driver Version: 530.41.03, CUDA Version: 12.1)
Model: Tested mixtral:8x7b-instruct-v0.1-q4_K_M, mixtral:8x7b-instruct-v0.1-q6_K, llama2:7b-chat-q4_0
Env: Official Docker

/api/generate and /api/chat hangs complitely while version or tags info works well.
Even docker compose restart doesn't help, only complete down + up helps.

Observed this behavior sometimes with 0.1.23, but 0.1.25 makes things even worse - hangs approximately every hour.

<!-- gh-comment-id:1953497278 --> @StrikerRUS commented on GitHub (Feb 20, 2024): @jmorganca Unfortunately, it isn't fixed in 0.1.25. OS: Ubuntu 22.04.2 LTS GPU: NVIDIA RTX A6000 (Driver Version: 530.41.03, CUDA Version: 12.1) Model: Tested `mixtral:8x7b-instruct-v0.1-q4_K_M`, `mixtral:8x7b-instruct-v0.1-q6_K`, `llama2:7b-chat-q4_0` Env: Official Docker `/api/generate` and `/api/chat` hangs complitely while version or tags info works well. Even `docker compose restart` doesn't help, only complete `down + up` helps. Observed this behavior sometimes with 0.1.23, but 0.1.25 makes things even worse - hangs approximately every hour.
Author
Owner

@calebdel commented on GitHub (Feb 21, 2024):

@jmorganca, Likewise still seeing this issue after a small number of iterations on v0.1.25

<!-- gh-comment-id:1955837364 --> @calebdel commented on GitHub (Feb 21, 2024): @jmorganca, Likewise still seeing this issue after a small number of iterations on v0.1.25
Author
Owner

@EmanueleLenzi92 commented on GitHub (Feb 22, 2024):

I think I have the same problem. After a few runs, the ollama server crashes and stops to generate text. I'm using windows 11 (wsl ubuntu) and langchain. I have a rtx 4090 and I tried from 0.1.16 to 0.1.19, but all of them have this issue in my case. instead, on a laptop with windows 10 and with an nvidia T500, I don't have this problem.

I confirm this problem with 0.1.25 and 0.1.26

<!-- gh-comment-id:1959599288 --> @EmanueleLenzi92 commented on GitHub (Feb 22, 2024): > I think I have the same problem. After a few runs, the ollama server crashes and stops to generate text. I'm using windows 11 (wsl ubuntu) and langchain. I have a rtx 4090 and I tried from 0.1.16 to 0.1.19, but all of them have this issue in my case. instead, on a laptop with windows 10 and with an nvidia T500, I don't have this problem. I confirm this problem with 0.1.25 and 0.1.26
Author
Owner

@StrikerRUS commented on GitHub (Feb 23, 2024):

@jmorganca Can you please reopen this issue?

<!-- gh-comment-id:1961527009 --> @StrikerRUS commented on GitHub (Feb 23, 2024): @jmorganca Can you please reopen this issue?
Author
Owner

@BEpresent commented on GitHub (Feb 23, 2024):

Same here, issue still persists on fresh install (calling multiple times in a loop).

<!-- gh-comment-id:1961838189 --> @BEpresent commented on GitHub (Feb 23, 2024): Same here, issue still persists on fresh install (calling multiple times in a loop).
Author
Owner

@ArjonBu commented on GitHub (Feb 24, 2024):

I am seeing this with 0.1.27 running on docker on linux. Docker has a limit of 8GB of RAM but the container is using only 1.

The container just hangs and shows nothing in logs. I am using open-webui as a frontend.

<!-- gh-comment-id:1962749186 --> @ArjonBu commented on GitHub (Feb 24, 2024): I am seeing this with 0.1.27 running on docker on linux. Docker has a limit of 8GB of RAM but the container is using only 1. The container just hangs and shows nothing in logs. I am using open-webui as a frontend.
Author
Owner

@julienlesbegueriesperso commented on GitHub (Feb 25, 2024):

I confirm alors on 0.1.27 on Mac OS X, Fedora with GPU (RTX), and Ubuntu (without GPU). In a fastapi + langchain env with 2 endpoints invoking 2 different ollama models , after I succeed in receiving responses from the first endpoint, I'm stuck when I try the 2nd endpoint. I have to restart the ollama service to see my response.

<!-- gh-comment-id:1962876989 --> @julienlesbegueriesperso commented on GitHub (Feb 25, 2024): I confirm alors on 0.1.27 on Mac OS X, Fedora with GPU (RTX), and Ubuntu (without GPU). In a fastapi + langchain env with 2 endpoints invoking 2 different ollama models , after I succeed in receiving responses from the first endpoint, I'm stuck when I try the 2nd endpoint. I have to restart the ollama service to see my response.
Author
Owner

@ytlai1985 commented on GitHub (Feb 27, 2024):

I confirm that this problem occurs with versions 0.1.24 and 0.1.27. After adding a prompt about the output limitation, it seems to be resolved.
Does that mean no [EOS] token has been generated? Using the 'STOP' options will also resolve this problem, but sometimes it may not achieve the ideal result.

OS: Ubuntu 22.04.2 LTS
GPU: NVIDIA L4 (Driver Version: 535.154.05, CUDA Version: 12.2)
Model: Mixtral8x7b-instruct-v0.1-q5_K_M

image

For example
Limitation prompt

    text_to_gpt = (
        f'[INST] You are a helpful assistant. Your task is to read the following context.'
        f'You will respond with a JSON object containing the entire sentence only if it contains abnormal words,'
        f' along with the confidence score. '
        f' Display only five objects with a high confidence score (greater than 0.6).'    #  Adding limitation
        f'Abnormal words are defined as 'error', 'unknown', 'fail', 'alert', etc. Do not provide explanations.'
        f'context """\n{input_text}\n"""\n'
    )

Use options

        response = self.client.chat(
            model=self.model,
            messages=[
                {
                    'role': 'user',
                    'content': input_text,
                },
            ],
            options={
                'top_k': 10,
                'temperature': 0.6,
                'stop': ['\n']
            }
        )
<!-- gh-comment-id:1966060879 --> @ytlai1985 commented on GitHub (Feb 27, 2024): I confirm that this problem occurs with versions 0.1.24 and 0.1.27. After adding a prompt about the output limitation, it seems to be resolved. Does that mean no [EOS] token has been generated? Using the 'STOP' options will also resolve this problem, but sometimes it may not achieve the ideal result. OS: Ubuntu 22.04.2 LTS GPU: NVIDIA L4 (Driver Version: 535.154.05, CUDA Version: 12.2) Model: Mixtral8x7b-instruct-v0.1-q5_K_M ![image](https://github.com/ollama/ollama/assets/105573237/0fdbf3a7-7dd7-4edf-9831-6d3cba76c4d3) For example Limitation prompt ``` python text_to_gpt = ( f'[INST] You are a helpful assistant. Your task is to read the following context.' f'You will respond with a JSON object containing the entire sentence only if it contains abnormal words,' f' along with the confidence score. ' f' Display only five objects with a high confidence score (greater than 0.6).' # Adding limitation f'Abnormal words are defined as 'error', 'unknown', 'fail', 'alert', etc. Do not provide explanations.' f'context """\n{input_text}\n"""\n' ) ``` Use options ```python response = self.client.chat( model=self.model, messages=[ { 'role': 'user', 'content': input_text, }, ], options={ 'top_k': 10, 'temperature': 0.6, 'stop': ['\n'] } ) ```
Author
Owner

@dhiltgen commented on GitHub (Feb 28, 2024):

Has anyone come up with a minimal repro with curl or equivalent? I'll try to repro and get to the bottom of this.

<!-- gh-comment-id:1967982345 --> @dhiltgen commented on GitHub (Feb 28, 2024): Has anyone come up with a minimal repro with curl or equivalent? I'll try to repro and get to the bottom of this.
Author
Owner

@wizardsd commented on GitHub (Mar 1, 2024):

I confirm this problem with 0.1.27 on Windows 10 without WSL. May be related with format=json and stream=false?

<!-- gh-comment-id:1972552388 --> @wizardsd commented on GitHub (Mar 1, 2024): I confirm this problem with 0.1.27 on Windows 10 without WSL. May be related with format=json and stream=false?
Author
Owner

@koleshjr commented on GitHub (Mar 4, 2024):

Could someone help us. This issue still persists :
I have updated to the latest release version: v0.1.28 and it still get's stuck after around 200 iterations on google collab free tier t4

<!-- gh-comment-id:1977280012 --> @koleshjr commented on GitHub (Mar 4, 2024): Could someone help us. This issue still persists : I have updated to the latest release version: v0.1.28 and it still get's stuck after around 200 iterations on google collab free tier t4
Author
Owner

@eusthace811 commented on GitHub (Mar 5, 2024):

In my case, it becomes unresponsive right after the initial interaction.

1 gpu Nvidia A4500, 13b q4 K M model

<!-- gh-comment-id:1979378676 --> @eusthace811 commented on GitHub (Mar 5, 2024): In my case, it becomes unresponsive right after the initial interaction. 1 gpu Nvidia A4500, 13b q4 K M model
Author
Owner

@giedriusrflt commented on GitHub (Mar 6, 2024):

Gets stuck also with:

$ ollama -v
ollama version is 0.1.28
<!-- gh-comment-id:1980371808 --> @giedriusrflt commented on GitHub (Mar 6, 2024): Gets stuck also with: ```bash $ ollama -v ollama version is 0.1.28 ```
Author
Owner

@jonomillin commented on GitHub (Mar 6, 2024):

I'm getting the same thing with 0.1.26, 0.1.27, 0.1.28 on an M2 Max (64gb ram)
This is both in the cli ollama run llava and via python APIs (chat and generate). It will work fine on one or two images, then stall out. There is no crash, it just stops streaming new tokens and hangs.

Server logs are as follows for a sample run via Python:

time=2024-03-07T07:46:10.314+09:00 level=INFO source=images.go:710 msg="total blobs: 14"
time=2024-03-07T07:46:10.315+09:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-07T07:46:10.316+09:00 level=INFO source=routes.go:1021 msg="Listening on 127.0.0.1:11434 (version 0.1.28)"
time=2024-03-07T07:46:10.316+09:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-07T07:46:10.338+09:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [metal]"
[GIN] 2024/03/07 - 07:46:10 | 200 |      45.458µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/03/07 - 07:46:10 | 200 |       578.5µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/03/07 - 07:46:10 | 200 |     373.083µs |       127.0.0.1 | POST     "/api/show"
time=2024-03-07T07:46:15.526+09:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/metal/libext_server.dylib"
time=2024-03-07T07:46:15.526+09:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
loading library /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/metal/libext_server.dylib
{"function":"load_model","level":"INFO","line":380,"msg":"Multi Modal Mode Enabled","tid":"0x171ef7000","timestamp":1709765175}
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423
ggml_metal_init: loading '/var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   595.50 MiB, (  596.44 / 49152.00)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    32.89 MiB, (  629.33 / 49152.00)
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /Users/jono/.ollama/models/blobs/sha256:170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = liuhaotian
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = liuhaotian
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  3847.58 MiB, ( 4476.91 / 49152.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:      Metal buffer size =  3847.57 MiB
................................
....................................
..............................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423
ggml_metal_init: loading '/var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   256.00 MiB, ( 4733.59 / 49152.00)
llama_kv_cache_init:      Metal KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    13.02 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   164.02 MiB, ( 4897.61 / 49152.00)
llama_new_context_with_model:      Metal compute buffer size =   164.00 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.00 MiB
llama_new_context_with_model: graph splits (measure): 2
clip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    377
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 377 tensors from /Users/jono/.ollama/models/blobs/sha256:72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = openai/clip-vit-large-patch14-336
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                        clip.projector_type str              = mlp
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  18:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  235 tensors
clip_model_load: - type  f16:  142 tensors
clip_model_load: CLIP using Metal backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     595.49 MB
clip_model_load: metadata size:  0.14 MB
clip_model_load: params backend buffer size =  595.49 MB (377 tensors)
clip_model_load: compute allocated memory: 32.89 MB
{"function":"initialize","level":"INFO","line":433,"msg":"initializing slots","n_slots":1,"tid":"0x171ef7000","timestamp":1709765176}
{"function":"initialize","level":"INFO","line":445,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"0x171ef7000","timestamp":1709765176}
time=2024-03-07T07:46:16.208+09:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
{"function":"update_slots","level":"INFO","line":1565,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"0x17349f000","timestamp":1709765176}
[GIN] 2024/03/07 - 07:46:16 | 200 |  5.709566167s |       127.0.0.1 | POST     "/api/chat"
update check failed - TypeError: fetch failed
time=2024-03-07T07:46:41.119+09:00 level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images"
{"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"0x17349f000","timestamp":1709765201}
{"function":"update_slots","level":"INFO","line":1825,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"0x17349f000","timestamp":1709765201}
encode_image_with_clip: image embedding created: 576 tokens

encode_image_with_clip: image encoded in   223.26 ms by CLIP (    0.39 ms per image patch)
{"function":"print_timings","level":"INFO","line":264,"msg":"prompt eval time     =    5730.16 ms /     1 tokens ( 5730.16 ms per token,     0.17 tokens per second)","n_prompt_tokens_processed":1,"n_tokens_second":0.17451528815004544,"slot_id":0,"t_prompt_processing":5730.157,"t_token":5730.157,"task_id":0,"tid":"0x17349f000","timestamp":1709765223}
{"function":"print_timings","level":"INFO","line":278,"msg":"generation eval time =   16943.76 ms /   411 runs   (   41.23 ms per token,    24.26 tokens per second)","n_decoded":411,"n_tokens_second":24.25672181205148,"slot_id":0,"t_token":41.225686131386865,"t_token_generation":16943.757,"task_id":0,"tid":"0x17349f000","timestamp":1709765223}
{"function":"print_timings","level":"INFO","line":287,"msg":"          total time =   22673.91 ms","slot_id":0,"t_prompt_processing":5730.157,"t_token_generation":16943.757,"t_total":22673.914,"task_id":0,"tid":"0x17349f000","timestamp":1709765223}
{"function":"update_slots","level":"INFO","line":1635,"msg":"slot released","n_cache_tokens":412,"n_ctx":2048,"n_past":1971,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"0x17349f000","timestamp":1709765223,"truncated":false}
[GIN] 2024/03/07 - 07:47:03 | 200 | 22.698469125s |       127.0.0.1 | POST     "/api/chat"
time=2024-03-07T07:47:03.825+09:00 level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images"
{"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":414,"tid":"0x17349f000","timestamp":1709765223}
{"function":"update_slots","level":"INFO","line":1825,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":414,"tid":"0x17349f000","timestamp":1709765223}
<!-- gh-comment-id:1982010052 --> @jonomillin commented on GitHub (Mar 6, 2024): I'm getting the same thing with 0.1.26, 0.1.27, 0.1.28 on an M2 Max (64gb ram) This is both in the cli `ollama run llava` and via python APIs (chat and generate). It will work fine on one or two images, then stall out. There is no crash, it just stops streaming new tokens and hangs. Server logs are as follows for a sample run via Python: ``` time=2024-03-07T07:46:10.314+09:00 level=INFO source=images.go:710 msg="total blobs: 14" time=2024-03-07T07:46:10.315+09:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-07T07:46:10.316+09:00 level=INFO source=routes.go:1021 msg="Listening on 127.0.0.1:11434 (version 0.1.28)" time=2024-03-07T07:46:10.316+09:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-07T07:46:10.338+09:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [metal]" [GIN] 2024/03/07 - 07:46:10 | 200 | 45.458µs | 127.0.0.1 | HEAD "/" [GIN] 2024/03/07 - 07:46:10 | 200 | 578.5µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/07 - 07:46:10 | 200 | 373.083µs | 127.0.0.1 | POST "/api/show" time=2024-03-07T07:46:15.526+09:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/metal/libext_server.dylib" time=2024-03-07T07:46:15.526+09:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" loading library /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/metal/libext_server.dylib {"function":"load_model","level":"INFO","line":380,"msg":"Multi Modal Mode Enabled","tid":"0x171ef7000","timestamp":1709765175} ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 Max ggml_metal_init: picking default device: Apple M2 Max ggml_metal_init: default.metallib not found, loading from source ggml_metal_init: GGML_METAL_PATH_RESOURCES = /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423 ggml_metal_init: loading '/var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/ggml-metal.metal' ggml_metal_init: GPU name: Apple M2 Max ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 595.50 MiB, ( 596.44 / 49152.00) key clip.vision.image_grid_pinpoints not found in file key clip.vision.mm_patch_merge_type not found in file key clip.vision.image_crop_resolution not found in file ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 32.89 MiB, ( 629.33 / 49152.00) llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /Users/jono/.ollama/models/blobs/sha256:170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = liuhaotian llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = liuhaotian llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.22 MiB ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 3847.58 MiB, ( 4476.91 / 49152.00) llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: Metal buffer size = 3847.57 MiB ................................ .................................... .............................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 Max ggml_metal_init: picking default device: Apple M2 Max ggml_metal_init: default.metallib not found, loading from source ggml_metal_init: GGML_METAL_PATH_RESOURCES = /var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423 ggml_metal_init: loading '/var/folders/2t/ln_5pr392nd9dytmbrtvwhgw0000gp/T/ollama2139231423/ggml-metal.metal' ggml_metal_init: GPU name: Apple M2 Max ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 256.00 MiB, ( 4733.59 / 49152.00) llama_kv_cache_init: Metal KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CPU input buffer size = 13.02 MiB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 164.02 MiB, ( 4897.61 / 49152.00) llama_new_context_with_model: Metal compute buffer size = 164.00 MiB llama_new_context_with_model: CPU compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 2 clip_model_load: model name: openai/clip-vit-large-patch14-336 clip_model_load: description: image encoder for LLaVA clip_model_load: GGUF version: 3 clip_model_load: alignment: 32 clip_model_load: n_tensors: 377 clip_model_load: n_kv: 19 clip_model_load: ftype: f16 clip_model_load: loaded meta data with 19 key-value pairs and 377 tensors from /Users/jono/.ollama/models/blobs/sha256:72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539 clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output. clip_model_load: - kv 0: general.architecture str = clip clip_model_load: - kv 1: clip.has_text_encoder bool = false clip_model_load: - kv 2: clip.has_vision_encoder bool = true clip_model_load: - kv 3: clip.has_llava_projector bool = true clip_model_load: - kv 4: general.file_type u32 = 1 clip_model_load: - kv 5: general.name str = openai/clip-vit-large-patch14-336 clip_model_load: - kv 6: general.description str = image encoder for LLaVA clip_model_load: - kv 7: clip.projector_type str = mlp clip_model_load: - kv 8: clip.vision.image_size u32 = 336 clip_model_load: - kv 9: clip.vision.patch_size u32 = 14 clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1024 clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4096 clip_model_load: - kv 12: clip.vision.projection_dim u32 = 768 clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16 clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000010 clip_model_load: - kv 15: clip.vision.block_count u32 = 23 clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.481455, 0.457828, 0.408211] clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.268630, 0.261303, 0.275777] clip_model_load: - kv 18: clip.use_gelu bool = false clip_model_load: - type f32: 235 tensors clip_model_load: - type f16: 142 tensors clip_model_load: CLIP using Metal backend clip_model_load: text_encoder: 0 clip_model_load: vision_encoder: 1 clip_model_load: llava_projector: 1 clip_model_load: model size: 595.49 MB clip_model_load: metadata size: 0.14 MB clip_model_load: params backend buffer size = 595.49 MB (377 tensors) clip_model_load: compute allocated memory: 32.89 MB {"function":"initialize","level":"INFO","line":433,"msg":"initializing slots","n_slots":1,"tid":"0x171ef7000","timestamp":1709765176} {"function":"initialize","level":"INFO","line":445,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"0x171ef7000","timestamp":1709765176} time=2024-03-07T07:46:16.208+09:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" {"function":"update_slots","level":"INFO","line":1565,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"0x17349f000","timestamp":1709765176} [GIN] 2024/03/07 - 07:46:16 | 200 | 5.709566167s | 127.0.0.1 | POST "/api/chat" update check failed - TypeError: fetch failed time=2024-03-07T07:46:41.119+09:00 level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images" {"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"0x17349f000","timestamp":1709765201} {"function":"update_slots","level":"INFO","line":1825,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"0x17349f000","timestamp":1709765201} encode_image_with_clip: image embedding created: 576 tokens encode_image_with_clip: image encoded in 223.26 ms by CLIP ( 0.39 ms per image patch) {"function":"print_timings","level":"INFO","line":264,"msg":"prompt eval time = 5730.16 ms / 1 tokens ( 5730.16 ms per token, 0.17 tokens per second)","n_prompt_tokens_processed":1,"n_tokens_second":0.17451528815004544,"slot_id":0,"t_prompt_processing":5730.157,"t_token":5730.157,"task_id":0,"tid":"0x17349f000","timestamp":1709765223} {"function":"print_timings","level":"INFO","line":278,"msg":"generation eval time = 16943.76 ms / 411 runs ( 41.23 ms per token, 24.26 tokens per second)","n_decoded":411,"n_tokens_second":24.25672181205148,"slot_id":0,"t_token":41.225686131386865,"t_token_generation":16943.757,"task_id":0,"tid":"0x17349f000","timestamp":1709765223} {"function":"print_timings","level":"INFO","line":287,"msg":" total time = 22673.91 ms","slot_id":0,"t_prompt_processing":5730.157,"t_token_generation":16943.757,"t_total":22673.914,"task_id":0,"tid":"0x17349f000","timestamp":1709765223} {"function":"update_slots","level":"INFO","line":1635,"msg":"slot released","n_cache_tokens":412,"n_ctx":2048,"n_past":1971,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"0x17349f000","timestamp":1709765223,"truncated":false} [GIN] 2024/03/07 - 07:47:03 | 200 | 22.698469125s | 127.0.0.1 | POST "/api/chat" time=2024-03-07T07:47:03.825+09:00 level=INFO source=dyn_ext_server.go:171 msg="loaded 1 images" {"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":414,"tid":"0x17349f000","timestamp":1709765223} {"function":"update_slots","level":"INFO","line":1825,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":414,"tid":"0x17349f000","timestamp":1709765223} ```
Author
Owner

@jithinmukundan commented on GitHub (Mar 7, 2024):

I am facing the same issue after running it on gpu. I had no issues previously when running it only on cpu. Using 0.1.28 and llama index. Eagerly waiting for a solution.

<!-- gh-comment-id:1983930245 --> @jithinmukundan commented on GitHub (Mar 7, 2024): I am facing the same issue after running it on gpu. I had no issues previously when running it only on cpu. Using 0.1.28 and llama index. Eagerly waiting for a solution.
Author
Owner

@urinieto commented on GitHub (Mar 9, 2024):

Same issue here :(

I'm on version 0.1.28. It seems to stop working after ~100 to ~3000 queries in my Linux setup.

<!-- gh-comment-id:1986909213 --> @urinieto commented on GitHub (Mar 9, 2024): Same issue here :( I'm on version 0.1.28. It seems to stop working after ~100 to ~3000 queries in my Linux setup.
Author
Owner

@ckehagioglou commented on GitHub (Mar 10, 2024):

Has anyone come up with a minimal repro with curl or equivalent? I'll try to repro and get to the bottom of this.

Regretfully, haven't managed to do so. Nevertheless, I went through the logs and noticed that when Ollama hangs, instead of the normal functions sequence:
launch_slot_with_data: slot processing task -> update_slots: slot progression, kv cache -> print timings: prompt, generation -> update slots: slot release

it goes through the following:
launch_slot_with_data: slot processing task -> update_slots: slot progression, kv cache -> update_slots: slot context shift

The last function executes infinitely until I stop the server and relaunch it. So, might be related to another issue I found (sorry I haven't pinpointed the number) related to infinite context shifting.

Hope the above provides a bit of assistance.

<!-- gh-comment-id:1987231185 --> @ckehagioglou commented on GitHub (Mar 10, 2024): > Has anyone come up with a minimal repro with curl or equivalent? I'll try to repro and get to the bottom of this. Regretfully, haven't managed to do so. Nevertheless, I went through the logs and noticed that when Ollama hangs, instead of the normal functions sequence: launch_slot_with_data: slot processing task -> update_slots: slot progression, kv cache -> print timings: prompt, generation -> update slots: slot release it goes through the following: launch_slot_with_data: slot processing task -> update_slots: slot progression, kv cache -> update_slots: slot context shift The last function executes infinitely until I stop the server and relaunch it. So, might be related to another issue I found (sorry I haven't pinpointed the number) related to infinite context shifting. Hope the above provides a bit of assistance.
Author
Owner

@harmanpreet93 commented on GitHub (Mar 14, 2024):

Facing a similar issue inside docker on Ubuntu 18.04 with ollama version 0.1.28 on Quadro RTX 5000.

<!-- gh-comment-id:1998471636 --> @harmanpreet93 commented on GitHub (Mar 14, 2024): Facing a similar issue inside docker on Ubuntu 18.04 with `ollama version 0.1.28` on Quadro RTX 5000.
Author
Owner

@syrom commented on GitHub (Mar 14, 2024):

Same issue. Running ollama 0.1.28 on M1 Max.
My observation:
I work on a large number of text chunks as input for a RAG algorithms - and the task for the LLM (mixtral in my case) is to extract keywords and concepts from the chunks. The document is rather large; so it makes quite a diffference if I set the character count for the text split to produce chunks of a length of 1.000 or 2.000 characters. These chunks are served to Mixtral as USER_PROMPT - and the SYSTEM_PROMPT by itself is also rather long.
Now the key observation: the failure seems to be functionally dependent on the length of the overall prompt.
If I set text-split to 2.000 characters, the overall prompt lenghth is much longer - and the failure occurs much quicker (5-10 generations) than if the text-split is set to 1.000 characters (around 15-20 generations). Unfortunately, the algorithsm ought to work its way thru > 400 to 800 chunks.... which it doesn't.
Long story short: the occurence of the bug seems to be a function of the number of tokens being served to the LLM thru Ollama.

<!-- gh-comment-id:1998531533 --> @syrom commented on GitHub (Mar 14, 2024): Same issue. Running ollama 0.1.28 on M1 Max. My observation: I work on a large number of text chunks as input for a RAG algorithms - and the task for the LLM (mixtral in my case) is to extract keywords and concepts from the chunks. The document is rather large; so it makes quite a diffference if I set the character count for the text split to produce chunks of a length of 1.000 or 2.000 characters. These chunks are served to Mixtral as USER_PROMPT - and the SYSTEM_PROMPT by itself is also rather long. Now the key observation: the failure seems to be functionally dependent on the length of the overall prompt. If I set text-split to 2.000 characters, the overall prompt lenghth is much longer - and the failure occurs much quicker (5-10 generations) than if the text-split is set to 1.000 characters (around 15-20 generations). Unfortunately, the algorithsm ought to work its way thru > 400 to 800 chunks.... which it doesn't. Long story short: the occurence of the bug seems to be a function of the number of tokens being served to the LLM thru Ollama.
Author
Owner

@niyogrv commented on GitHub (Mar 20, 2024):

I'm observing this issue in 0.1.28 on Ubuntu 22.04 with a 3060(Driver: 535.161.07, CUDA: 12.2) and 16GB RAM running TheBloke's Q6 Mistral Instruct v0.2 GGUF.
I'm encountering this only when I'm setting "format"="json". I am using the model for a classification task and only got through 5 queries before it hung up and I had to restart ollama. I was able to reproduce this consistently and it always failed at the 6th query

I reran it, this time without the "format"="json" param, and I am 4k+ requests in without a crash

UPDATE:
It crashed at around 5.7k requests. So, while the json format enforcement seems to accelerate the issue, it still seems to happen if you're constantly bombarding the model with requests. Hopefully, this gets fixed soon :(

[GIN] 2024/03/20 - 10:19:18 | 200 |  2.654764194s | 2406:8800:80:b281:35bb:b529:139e:a792 | POST     "/api/generate"
time=2024-03-20T12:09:40.398+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-20T12:09:40.398+05:30 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-20T12:09:40.398+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-20T12:09:40.398+05:30 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-20T12:09:40.398+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
loading library /tmp/ollama1233141962/cuda_v11/libext_server.so
time=2024-03-20T12:09:40.399+05:30 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama1233141962/cuda_v11/libext_server.so"
time=2024-03-20T12:09:40.399+05:30 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /home/pcname/.ollama/models/blobs/sha256:a4643671c92f47eb6027d0eff50b9875562e8e172128a4b10b2be250bb4264de (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 18
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:  226 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 5.53 GiB (6.56 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   102.54 MiB
llm_load_tensors:      CUDA0 buffer size =  5563.55 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llama_kv_cache_init:      CUDA0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    13.02 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   164.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     8.00 MiB
llama_new_context_with_model: graph splits (measure): 2
{"function":"initialize","level":"INFO","line":433,"msg":"initializing slots","n_slots":1,"tid":"125807536088640","timestamp":1710916782}
{"function":"initialize","level":"INFO","line":442,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"125807536088640","timestamp":1710916782}
time=2024-03-20T12:09:42.092+05:30 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
{"function":"update_slots","level":"INFO","line":1565,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"125806336538176","timestamp":1710916782}
{"function":"launch_slot_with_data","level":"INFO","line":823,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916782}
{"function":"update_slots","level":"INFO","line":1796,"msg":"slot progression","n_past":0,"n_prompt_tokens_processed":507,"slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916782}
{"function":"update_slots","level":"INFO","line":1821,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916782}
{"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =     696.16 ms /   507 tokens (    1.37 ms per token,   728.28 tokens per second)","n_prompt_tokens_processed":507,"n_tokens_second":728.2839934095325,"slot_id":0,"t_prompt_processing":696.157,"t_token":1.3730907297830375,"task_id":0,"tid":"125806336538176","timestamp":1710916784}
{"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =    1756.20 ms /    77 runs   (   22.81 ms per token,    43.84 tokens per second)","n_decoded":77,"n_tokens_second":43.8446396511561,"slot_id":0,"t_token":22.807805194805194,"t_token_generation":1756.201,"task_id":0,"tid":"125806336538176","timestamp":1710916784}
{"function":"print_timings","level":"INFO","line":281,"msg":"          total time =    2452.36 ms","slot_id":0,"t_prompt_processing":696.157,"t_token_generation":1756.201,"t_total":2452.358,"task_id":0,"tid":"125806336538176","timestamp":1710916784}
{"function":"update_slots","level":"INFO","line":1627,"msg":"slot released","n_cache_tokens":584,"n_ctx":2048,"n_past":583,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916784,"truncated":false}
[GIN] 2024/03/20 - 12:09:44 | 200 |  4.410814509s |   116.68.72.166 | POST     "/api/generate"

<!-- gh-comment-id:2008633480 --> @niyogrv commented on GitHub (Mar 20, 2024): I'm observing this issue in 0.1.28 on Ubuntu 22.04 with a 3060(Driver: 535.161.07, CUDA: 12.2) and 16GB RAM running TheBloke's Q6 Mistral Instruct v0.2 GGUF. I'm encountering this only when I'm setting "format"="json". I am using the model for a classification task and only got through 5 queries before it hung up and I had to restart ollama. I was able to reproduce this consistently and it always failed at the 6th query I reran it, this time without the "format"="json" param, and I am 4k+ requests in without a crash UPDATE: It crashed at around 5.7k requests. So, while the json format enforcement seems to accelerate the issue, it still seems to happen if you're constantly bombarding the model with requests. Hopefully, this gets fixed soon :( ``` [GIN] 2024/03/20 - 10:19:18 | 200 | 2.654764194s | 2406:8800:80:b281:35bb:b529:139e:a792 | POST "/api/generate" time=2024-03-20T12:09:40.398+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-20T12:09:40.398+05:30 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-20T12:09:40.398+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-20T12:09:40.398+05:30 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-20T12:09:40.398+05:30 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" loading library /tmp/ollama1233141962/cuda_v11/libext_server.so time=2024-03-20T12:09:40.399+05:30 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama1233141962/cuda_v11/libext_server.so" time=2024-03-20T12:09:40.399+05:30 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /home/pcname/.ollama/models/blobs/sha256:a4643671c92f47eb6027d0eff50b9875562e8e172128a4b10b2be250bb4264de (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2 llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 18 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q6_K: 226 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q6_K llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 5.53 GiB (6.56 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.22 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 102.54 MiB llm_load_tensors: CUDA0 buffer size = 5563.55 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes ggml_init_cublas: CUDA_USE_TENSOR_CORES: no ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 164.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 2 {"function":"initialize","level":"INFO","line":433,"msg":"initializing slots","n_slots":1,"tid":"125807536088640","timestamp":1710916782} {"function":"initialize","level":"INFO","line":442,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"125807536088640","timestamp":1710916782} time=2024-03-20T12:09:42.092+05:30 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" {"function":"update_slots","level":"INFO","line":1565,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"125806336538176","timestamp":1710916782} {"function":"launch_slot_with_data","level":"INFO","line":823,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916782} {"function":"update_slots","level":"INFO","line":1796,"msg":"slot progression","n_past":0,"n_prompt_tokens_processed":507,"slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916782} {"function":"update_slots","level":"INFO","line":1821,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916782} {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 696.16 ms / 507 tokens ( 1.37 ms per token, 728.28 tokens per second)","n_prompt_tokens_processed":507,"n_tokens_second":728.2839934095325,"slot_id":0,"t_prompt_processing":696.157,"t_token":1.3730907297830375,"task_id":0,"tid":"125806336538176","timestamp":1710916784} {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 1756.20 ms / 77 runs ( 22.81 ms per token, 43.84 tokens per second)","n_decoded":77,"n_tokens_second":43.8446396511561,"slot_id":0,"t_token":22.807805194805194,"t_token_generation":1756.201,"task_id":0,"tid":"125806336538176","timestamp":1710916784} {"function":"print_timings","level":"INFO","line":281,"msg":" total time = 2452.36 ms","slot_id":0,"t_prompt_processing":696.157,"t_token_generation":1756.201,"t_total":2452.358,"task_id":0,"tid":"125806336538176","timestamp":1710916784} {"function":"update_slots","level":"INFO","line":1627,"msg":"slot released","n_cache_tokens":584,"n_ctx":2048,"n_past":583,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"125806336538176","timestamp":1710916784,"truncated":false} [GIN] 2024/03/20 - 12:09:44 | 200 | 4.410814509s | 116.68.72.166 | POST "/api/generate" ```
Author
Owner

@dhiltgen commented on GitHub (Mar 20, 2024):

This will likely be resolved with #3218 but I'll leave this open until we can verify the health check logic is sufficient to catch this hang scenario.

<!-- gh-comment-id:2010185380 --> @dhiltgen commented on GitHub (Mar 20, 2024): This will likely be resolved with #3218 but I'll leave this open until we can verify the health check logic is sufficient to catch this hang scenario.
Author
Owner

@syrom commented on GitHub (Mar 31, 2024):

My first feedback after the last Ollama update: the situation has improved a lot, but has not gone away alltogether.
I tried it out on several text sizes - and now it works on longer texts, but still gets eventually stuck on very long texts.
Before the last update, I would not get more than 30 generations in a row, feeding the algorithms with text chunks of the size of 1.000 characters. Now, it works up to around 100 generations and slightly north of that.
But e.g. processing text consisting of 200 or more chunks still gets the process stuck eventually.

<!-- gh-comment-id:2028882838 --> @syrom commented on GitHub (Mar 31, 2024): My first feedback after the last Ollama update: the situation has improved a lot, but has not gone away alltogether. I tried it out on several text sizes - and now it works on longer texts, but still gets eventually stuck on very long texts. Before the last update, I would not get more than 30 generations in a row, feeding the algorithms with text chunks of the size of 1.000 characters. Now, it works up to around 100 generations and slightly north of that. But e.g. processing text consisting of 200 or more chunks still gets the process stuck eventually.
Author
Owner

@omani commented on GitHub (Apr 5, 2024):

I have to stop docker. rm the docker instance and run it again to solve this issue. I hope someone fixes this soon.

<!-- gh-comment-id:2040647302 --> @omani commented on GitHub (Apr 5, 2024): I have to stop docker. rm the docker instance and run it again to solve this issue. I hope someone fixes this soon.
Author
Owner

@omani commented on GitHub (Apr 5, 2024):

here is an example of my local ollama in docker hallucinating:

 docker exec -it ollama ollama run starcoder2
>>> write a simple hello world program in golang
 and build an image from it.

This is the command that will be used:
```sh
docker run -it --name hello-go-app --rm -v ${PWD}:/go/src/hello-go-app golang:1.7 go run /go/src/hello-go-app/*.go

## Building the image

The first step is to create a Docker file that will contain the commands for building and running our application. The Dockerfile for our app will have two sections. We will use COPY and ENTRYPOINT for
this. Copy will copy the contents of our app folder into /go/src/hello^C

this happens with almost all models after some time. sometimes within minutes of cancelling and restarting the model.

<!-- gh-comment-id:2040648717 --> @omani commented on GitHub (Apr 5, 2024): here is an example of my local ollama in docker hallucinating: ``` docker exec -it ollama ollama run starcoder2 >>> write a simple hello world program in golang and build an image from it. This is the command that will be used: ```sh docker run -it --name hello-go-app --rm -v ${PWD}:/go/src/hello-go-app golang:1.7 go run /go/src/hello-go-app/*.go ## Building the image The first step is to create a Docker file that will contain the commands for building and running our application. The Dockerfile for our app will have two sections. We will use COPY and ENTRYPOINT for this. Copy will copy the contents of our app folder into /go/src/hello^C ``` this happens with almost all models after some time. sometimes within minutes of cancelling and restarting the model.
Author
Owner

@omani commented on GitHub (Apr 5, 2024):

what is this? why is this happening with all my models? does it have anything to do with ollama?

 docker exec -it ollama ollama run starcoder2
>>> write a simple hello world program in golang.
 1.

Go is a compiled language. You can't compile Go programs with gcc or
clang on Linux. So the first thing you need to do is install a C compiler.
In this case, you will use GCC. If you don’t know how to do that, please consult
the previous post: Installing^C

>>> Send a message (/? for help)
<!-- gh-comment-id:2040653192 --> @omani commented on GitHub (Apr 5, 2024): what is this? why is this happening with all my models? does it have anything to do with ollama? ``` docker exec -it ollama ollama run starcoder2 >>> write a simple hello world program in golang. 1. Go is a compiled language. You can't compile Go programs with gcc or clang on Linux. So the first thing you need to do is install a C compiler. In this case, you will use GCC. If you don’t know how to do that, please consult the previous post: Installing^C >>> Send a message (/? for help) ```
Author
Owner

@traddo commented on GitHub (Apr 7, 2024):

I ran into the same snag when I was working on summarizing text. Tweaking the prompt words sorted it out for me. I added a bit in the prompt to make sure the summary stays between 100 and 200 words.

<!-- gh-comment-id:2041373245 --> @traddo commented on GitHub (Apr 7, 2024): I ran into the same snag when I was working on summarizing text. Tweaking the prompt words sorted it out for me. I added a bit in the prompt to make sure the summary stays between 100 and 200 words.
Author
Owner

@Mecil9 commented on GitHub (Apr 7, 2024):

the same issue!
image
my system:
apple m1 max 64G
At the initial run, everything works fine. When questions are asked continuously, the system will be stuck, the CPU usage will continue to increase, and the GPU will be reduced to 0 at the same time.
Once the CPU reaches 100%, ollama will stop working. I have tried many methods to no avail!

<!-- gh-comment-id:2041474018 --> @Mecil9 commented on GitHub (Apr 7, 2024): the same issue! ![image](https://github.com/ollama/ollama/assets/2948534/38769b40-36fd-47ae-8d41-ecc801eaa92b) my system: apple m1 max 64G At the initial run, everything works fine. When questions are asked continuously, the system will be stuck, the CPU usage will continue to increase, and the GPU will be reduced to 0 at the same time. Once the CPU reaches 100%, ollama will stop working. I have tried many methods to no avail!
Author
Owner

@WithAnOrchid commented on GitHub (Apr 8, 2024):

I ran into the same snag when I was working on summarizing text. Tweaking the prompt words sorted it out for me. I added a bit in the prompt to make sure the summary stays between 100 and 200 words.

This happened to me as well. After some research and test, I found setting the option num_kepp to 0 fixed this issue.

Possibly related to #2805 , #2225

Python code I used:

def send_summarization_request(text, system_prompt):
    url = summarization_endpoint
    payload = {
        "model": model,
        "prompt": text,
        "system": system_prompt,
        "stream": False,
        "keep_alive": "5m",
        "options": {
            "num_keep": 0,
            "num_batch": 8
        }
    }
    response = requests.post(url, json=payload)
    summary = json.loads(response.text)

    response_text = summary["response"]

    return response_text
<!-- gh-comment-id:2042900528 --> @WithAnOrchid commented on GitHub (Apr 8, 2024): > I ran into the same snag when I was working on summarizing text. Tweaking the prompt words sorted it out for me. I added a bit in the prompt to make sure the summary stays between 100 and 200 words. This happened to me as well. After some research and test, I found setting the option `num_kepp` to `0` fixed this issue. Possibly related to #2805 , #2225 Python code I used: ``` def send_summarization_request(text, system_prompt): url = summarization_endpoint payload = { "model": model, "prompt": text, "system": system_prompt, "stream": False, "keep_alive": "5m", "options": { "num_keep": 0, "num_batch": 8 } } response = requests.post(url, json=payload) summary = json.loads(response.text) response_text = summary["response"] return response_text ```
Author
Owner

@ckehagioglou commented on GitHub (Apr 9, 2024):

I ran into the same snag when I was working on summarizing text. Tweaking the prompt words sorted it out for me. I added a bit in the prompt to make sure the summary stays between 100 and 200 words.

This happened to me as well. After some research and test, I found setting the option num_kepp to 0 fixed this issue.

Possibly related to #2805 , #2225

Python code I used:

def send_summarization_request(text, system_prompt):
    url = summarization_endpoint
    payload = {
        "model": model,
        "prompt": text,
        "system": system_prompt,
        "stream": False,
        "keep_alive": "5m",
        "options": {
            "num_keep": 0,
            "num_batch": 8
        }
    }
    response = requests.post(url, json=payload)
    summary = json.loads(response.text)

    response_text = summary["response"]

    return response_text

Looked all over the place to find what num_keep does but with no avail. All I found is that num_keep default value is 0. Version 1.31 hangs even more often for me. Team ollama is doing a great work, but this bug is destroying the experience.

Working on a Mac Studio M2 Max 32GB, running many summarization tasks in sequence - if it helps.

<!-- gh-comment-id:2044307483 --> @ckehagioglou commented on GitHub (Apr 9, 2024): > > I ran into the same snag when I was working on summarizing text. Tweaking the prompt words sorted it out for me. I added a bit in the prompt to make sure the summary stays between 100 and 200 words. > > This happened to me as well. After some research and test, I found setting the option `num_kepp` to `0` fixed this issue. > > Possibly related to #2805 , #2225 > > Python code I used: > > ``` > def send_summarization_request(text, system_prompt): > url = summarization_endpoint > payload = { > "model": model, > "prompt": text, > "system": system_prompt, > "stream": False, > "keep_alive": "5m", > "options": { > "num_keep": 0, > "num_batch": 8 > } > } > response = requests.post(url, json=payload) > summary = json.loads(response.text) > > response_text = summary["response"] > > return response_text > ``` Looked all over the place to find what num_keep does but with no avail. All I found is that num_keep default value is 0. Version 1.31 hangs even more often for me. Team ollama is doing a great work, but this bug is destroying the experience. Working on a Mac Studio M2 Max 32GB, running many summarization tasks in sequence - if it helps.
Author
Owner

@traddo commented on GitHub (Apr 9, 2024):

I added the num_keep parameter but the bug still exists. For now, I'm using a timeout to kill the ollama process as a workaround to complete the batch summary tasks.
I use a bash script to start the 'ollama serve &' process, checking every 2 seconds to see if the ollama process exists, and if not, start it.
When calling the API, I add a timeout limit of 1 minute. If a timeout exception occurs, I kill the ollama process, then wait for 2 seconds, remove the task causing the timeout, and start the loop again.

<!-- gh-comment-id:2044379030 --> @traddo commented on GitHub (Apr 9, 2024): I added the num_keep parameter but the bug still exists. For now, I'm using a timeout to kill the ollama process as a workaround to complete the batch summary tasks. I use a bash script to start the 'ollama serve &' process, checking every 2 seconds to see if the ollama process exists, and if not, start it. When calling the API, I add a timeout limit of 1 minute. If a timeout exception occurs, I kill the ollama process, then wait for 2 seconds, remove the task causing the timeout, and start the loop again.
Author
Owner

@abhinav-kashyap-asus commented on GitHub (Apr 9, 2024):

I also have this bug... :( Unfortunately sometimes even restarting the ollama server is not helping it
It just hangs

<!-- gh-comment-id:2044391868 --> @abhinav-kashyap-asus commented on GitHub (Apr 9, 2024): I also have this bug... :( Unfortunately sometimes even restarting the ollama server is not helping it It just hangs
Author
Owner

@mrroll commented on GitHub (Apr 9, 2024):

I have the same experience. Adding the parameter does not prevent Ollama from getting stuck.

<!-- gh-comment-id:2044419482 --> @mrroll commented on GitHub (Apr 9, 2024): I have the same experience. Adding the parameter does not prevent Ollama from getting stuck.
Author
Owner

@danest commented on GitHub (Apr 9, 2024):

This happens to me too so I wrote a bash script that manages and just restarts it every 10 minutes....

start_ollama() {
    nohup ollama serve &  # Start ollama serve in the background and ignore hangups
    echo "Ollama serve started."
}

while true; do

    ollama_pid=$(pgrep -f 'ollama')

    if [ ! -z "$ollama_pid" ]; then
        echo "Killing ollama serve process: $ollama_pid"
        kill $ollama_pid
        sleep 2
    fi
    start_ollama

    sleep 600
done

<!-- gh-comment-id:2045588515 --> @danest commented on GitHub (Apr 9, 2024): This happens to me too so I wrote a bash script that manages and just restarts it every 10 minutes.... ``` start_ollama() { nohup ollama serve & # Start ollama serve in the background and ignore hangups echo "Ollama serve started." } while true; do ollama_pid=$(pgrep -f 'ollama') if [ ! -z "$ollama_pid" ]; then echo "Killing ollama serve process: $ollama_pid" kill $ollama_pid sleep 2 fi start_ollama sleep 600 done ```
Author
Owner

@jdonaldson commented on GitHub (Apr 9, 2024):

Hitting the stability issue here as well. Had to add a reset action in my
neovim so I could poke it awake more easily :
https://github.com/jdonaldson/dotfiles/blob/main/.config/lvim/config.lua#L45

On Tue, Apr 9, 2024 at 9:17 AM Kevin Urrutia @.***>
wrote:

This happens to me too so I wrote a bash script that manages and just
restarts it every 10 minutes....

start_ollama() {
nohup ollama serve & # Start ollama serve in the background and ignore hangups
echo "Ollama serve started."
}

while true; do

ollama_pid=$(pgrep -f 'ollama')

if [ ! -z "$ollama_pid" ]; then
    echo "Killing ollama serve process: $ollama_pid"
    kill $ollama_pid
    sleep 2
fi
start_ollama

sleep 600

done


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/1863#issuecomment-2045588515,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAKEB2XJ24THSMCWDGVFFLY4QICDAVCNFSM6AAAAABBSZT5G6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBVGU4DQNJRGU
.
You are receiving this because you are subscribed to this thread.Message
ID: @.***>

<!-- gh-comment-id:2045784799 --> @jdonaldson commented on GitHub (Apr 9, 2024): Hitting the stability issue here as well. Had to add a reset action in my neovim so I could poke it awake more easily : https://github.com/jdonaldson/dotfiles/blob/main/.config/lvim/config.lua#L45 On Tue, Apr 9, 2024 at 9:17 AM Kevin Urrutia ***@***.***> wrote: > This happens to me too so I wrote a bash script that manages and just > restarts it every 10 minutes.... > > start_ollama() { > nohup ollama serve & # Start ollama serve in the background and ignore hangups > echo "Ollama serve started." > } > > while true; do > > ollama_pid=$(pgrep -f 'ollama') > > if [ ! -z "$ollama_pid" ]; then > echo "Killing ollama serve process: $ollama_pid" > kill $ollama_pid > sleep 2 > fi > start_ollama > > sleep 600 > done > > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/1863#issuecomment-2045588515>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAAKEB2XJ24THSMCWDGVFFLY4QICDAVCNFSM6AAAAABBSZT5G6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBVGU4DQNJRGU> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> >
Author
Owner

@dmitry-sablin-db commented on GitHub (Apr 12, 2024):

Got same issue , seem to be problems caused by using codelama 34b . But its not exactly , just iterative check

<!-- gh-comment-id:2051380523 --> @dmitry-sablin-db commented on GitHub (Apr 12, 2024): Got same issue , seem to be problems caused by using codelama 34b . But its not exactly , just iterative check
Author
Owner

@jtoy commented on GitHub (Apr 12, 2024):

Just to give more notes, I use ollama on mac and linux. On linux for me it seems moe stable. On a mac studio m1 and mac book pro m1, I have to restart it every dozen or so requests because it just freezes. I want to run this on my mac studio as a server, but its too unstable. I am going to just add a restart script every hour to see if that fixes it.

<!-- gh-comment-id:2051695942 --> @jtoy commented on GitHub (Apr 12, 2024): Just to give more notes, I use ollama on mac and linux. On linux for me it seems moe stable. On a mac studio m1 and mac book pro m1, I have to restart it every dozen or so requests because it just freezes. I want to run this on my mac studio as a server, but its too unstable. I am going to just add a restart script every hour to see if that fixes it.
Author
Owner

@javierrivarola commented on GitHub (Apr 15, 2024):

Same issue running on Macbook 16 M3 Max with 36 GB of ram, ollama hangs after an hour or so of usage, logs don't seem to indicate nothing wrong happened.. seems that i'll need to use a cronjob to restart it every hour

<!-- gh-comment-id:2057564984 --> @javierrivarola commented on GitHub (Apr 15, 2024): Same issue running on Macbook 16 M3 Max with 36 GB of ram, ollama hangs after an hour or so of usage, logs don't seem to indicate nothing wrong happened.. seems that i'll need to use a cronjob to restart it every hour
Author
Owner

@danomatika commented on GitHub (Apr 16, 2024):

We are seeing the same issue with Ubuntu 20.04 LTS and 2 x A100. So far I am taking a timeout check and restart approach by running the following script every 10 minutes with cron, ollama-check:

#! /bin/sh
# check if ollama api is not responding and restart service after timeout
# Dan Wilcox, ZKM | Hertzlab, zkm.de

# host
HOST=YOURSERVER:11434

# desired model
MODEL=llama2

# timeout in seconds
TIMEOUT=30

# make api call with timeout
curl --connect-timeout 5 --max-time $TIMEOUT -d "{
  \"model\": \"$MODEL\",
  \"prompt\": \"Why is the sky blue?\",
  \"format\": \"json\",
  \"stream\": false
}" http://${HOST}/api/generate 2> /dev/null

# force restart ollama on *any* non-zero exit code:
# https://everything.curl.dev/cmdline/exitcode.html
# ex. to check just timeout, use [ "$?" = "28" ]
if [ "$?" != "0" ] ; then
  killall ollama 2> /dev/null
  systemctl stop ollama 2> /dev/null
  systemctl start ollama
fi

This may not be the best solution, but we will try it for now.

<!-- gh-comment-id:2059184593 --> @danomatika commented on GitHub (Apr 16, 2024): We are seeing the same issue with Ubuntu 20.04 LTS and 2 x A100. So far I am taking a timeout check and restart approach by running the following script every 10 minutes with cron, `ollama-check`: ```sh #! /bin/sh # check if ollama api is not responding and restart service after timeout # Dan Wilcox, ZKM | Hertzlab, zkm.de # host HOST=YOURSERVER:11434 # desired model MODEL=llama2 # timeout in seconds TIMEOUT=30 # make api call with timeout curl --connect-timeout 5 --max-time $TIMEOUT -d "{ \"model\": \"$MODEL\", \"prompt\": \"Why is the sky blue?\", \"format\": \"json\", \"stream\": false }" http://${HOST}/api/generate 2> /dev/null # force restart ollama on *any* non-zero exit code: # https://everything.curl.dev/cmdline/exitcode.html # ex. to check just timeout, use [ "$?" = "28" ] if [ "$?" != "0" ] ; then killall ollama 2> /dev/null systemctl stop ollama 2> /dev/null systemctl start ollama fi ``` This may not be the best solution, but we will try it for now.
Author
Owner

@jossalgon commented on GitHub (Apr 16, 2024):

By modifying this with the latest version, I have not had any more problems. If anyone tries it too and it works for them I can do PR.

---
 llm/server.go | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/llm/server.go b/llm/server.go
index 0e084d5..f32e25c 100644
--- a/llm/server.go
+++ b/llm/server.go
@@ -503,6 +503,11 @@ type CompletionResponse struct {
 }
 
 func (s *LlamaServer) Completion(ctx context.Context, req CompletionRequest, fn func(CompletionResponse)) error {
+	// Set a timeout for the request
+	var cancelFunc context.CancelFunc
+      ctx, cancelFunc = context.WithTimeout(ctx, 60*time.Second)
+      defer cancelFunc()
+
 	request := map[string]any{
 		"prompt":            req.Prompt,
 		"stream":            true,
---
<!-- gh-comment-id:2059221287 --> @jossalgon commented on GitHub (Apr 16, 2024): By modifying this with the latest version, I have not had any more problems. If anyone tries it too and it works for them I can do PR. ``` --- llm/server.go | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/llm/server.go b/llm/server.go index 0e084d5..f32e25c 100644 --- a/llm/server.go +++ b/llm/server.go @@ -503,6 +503,11 @@ type CompletionResponse struct { } func (s *LlamaServer) Completion(ctx context.Context, req CompletionRequest, fn func(CompletionResponse)) error { + // Set a timeout for the request + var cancelFunc context.CancelFunc + ctx, cancelFunc = context.WithTimeout(ctx, 60*time.Second) + defer cancelFunc() + request := map[string]any{ "prompt": req.Prompt, "stream": true, --- ```
Author
Owner

@dhiltgen commented on GitHub (Apr 17, 2024):

Please give 0.1.32 a try and let us know if you're still seeing unrecoverable hangs.

<!-- gh-comment-id:2061942222 --> @dhiltgen commented on GitHub (Apr 17, 2024): Please give 0.1.32 a try and let us know if you're still seeing unrecoverable hangs.
Author
Owner

@airbj31 commented on GitHub (Apr 17, 2024):

Please give 0.1.32 a try and let us know if you're still seeing unrecoverable hangs.

I still have the same issue in both Linux computer (Ubuntu22.04 + GTX4090) and Macbook Pro (M3), but the tendency was reduced compared to the previous version (v0.1.30)

<!-- gh-comment-id:2062132251 --> @airbj31 commented on GitHub (Apr 17, 2024): > Please give 0.1.32 a try and let us know if you're still seeing unrecoverable hangs. I still have the same issue in both Linux computer (Ubuntu22.04 + GTX4090) and Macbook Pro (M3), but the tendency was reduced compared to the previous version (v0.1.30)
Author
Owner

@calebdel commented on GitHub (Apr 17, 2024):

0.1.32 seems to have fixed the issue for me. 2000+ iterations so far without a hang. Previously 5-10 iterations would cause Ollama to hang.

<!-- gh-comment-id:2062570465 --> @calebdel commented on GitHub (Apr 17, 2024): 0.1.32 seems to have fixed the issue for me. 2000+ iterations so far without a hang. Previously 5-10 iterations would cause Ollama to hang.
Author
Owner

@BruceMacD commented on GitHub (Apr 18, 2024):

Thanks to everyone for reporting and testing this. Marking this as resolved for now pending any more reports.

<!-- gh-comment-id:2062790505 --> @BruceMacD commented on GitHub (Apr 18, 2024): Thanks to everyone for reporting and testing this. Marking this as resolved for now pending any more reports.
Author
Owner

@kungfu-eric commented on GitHub (Apr 25, 2024):

Hangs after about 400 long context requests on mixtral and same with llama3

ollama --version
ollama version is 0.1.32

Please give 0.1.32 a try and let us know if you're still seeing unrecoverable hangs.

The hang continues to output this on the ollama server but no response is given to the client

{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056803}
{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056823}
{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056843}
<!-- gh-comment-id:2077451689 --> @kungfu-eric commented on GitHub (Apr 25, 2024): Hangs after about 400 long context requests on mixtral and same with llama3 ``` ollama --version ollama version is 0.1.32 ``` > Please give 0.1.32 a try and let us know if you're still seeing unrecoverable hangs. The hang continues to output this on the ollama server but no response is given to the client ``` {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056803} {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056823} {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":105393,"tid":"140517846056960","timestamp":1714056843} ```
Author
Owner

@entmike commented on GitHub (Apr 25, 2024):

Still having the problem here on version 0.1.32. I am running batches of image annotations with llava and it will just hang after a few dozen images or so.

RTX 4090
Ubuntu 22.04
Running via Docker container

Restarting the container kicks it back into submission but looking for a a more reliable answer.

<!-- gh-comment-id:2078072306 --> @entmike commented on GitHub (Apr 25, 2024): Still having the problem here on version `0.1.32`. I am running batches of image annotations with llava and it will just hang after a few dozen images or so. RTX 4090 Ubuntu 22.04 Running via Docker container Restarting the container kicks it back into submission but looking for a a more reliable answer.
Author
Owner

@kirill-vas commented on GitHub (Apr 26, 2024):

Also still experiencing hangs when calling /api/chat endpoint. Running on HumanEval benchmark (164 samples), usually fails about 70-80 calls. Requires ollama serve restart to recover. Mostly happens with CodeLlama-70b rather than the smaller models (13b, 7b; only tested these).

Running v0.1.32 on Ubuntu 22.04.2 with NVIDIA RTX A6000, driver 530.30.02, CUDA 12.1, using a Docker container

The specific part where it seems to loop indefinitely is the update_slots function
with "msg":"slot context shift" line from ollama serve logs (full logs below):

{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984503}

Code that calls the endpoint:

    url = "http://localhost:11434/api/chat"
    params = {
        'model': 'codellama:70b-instruct',
        'options': {
            'num_ctx': 4096
        },
        'messages': [
            {'role': 'system', 'content': sys_prompt},
            {"role": "user", "content": user_prompt},
        ],
        'options': {
            'seed': 123,
            'temperature': 0.2,
        },
        'stream': False,
        'keep_alive': 10 
    }
    response = requests.post(url, json=params).json()

Full log of the run below:

time=2024-04-24T14:30:43.820-04:00 level=INFO source=gpu.go:121 msg="Detecting GPU type"
time=2024-04-24T14:30:43.820-04:00 level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
time=2024-04-24T14:30:43.824-04:00 level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1496169119/runners/cuda_v11/libcudart.so.11.0]"
time=2024-04-24T14:30:44.105-04:00 level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
time=2024-04-24T14:30:44.105-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-24T14:30:44.387-04:00 level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6"
time=2024-04-24T14:30:44.437-04:00 level=INFO source=gpu.go:121 msg="Detecting GPU type"
time=2024-04-24T14:30:44.437-04:00 level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
time=2024-04-24T14:30:44.440-04:00 level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1496169119/runners/cuda_v11/libcudart.so.11.0]"
time=2024-04-24T14:30:44.440-04:00 level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
time=2024-04-24T14:30:44.440-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-24T14:30:44.505-04:00 level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6"
time=2024-04-24T14:30:44.536-04:00 level=INFO source=server.go:127 msg="offload to gpu" reallayers=81 layers=81 required="38351.2 MiB" used="38351.2 MiB" available="45354.3 MiB" kv="640.0 MiB" fulloffload="324.0 MiB" partialoffload="348.0 MiB"
time=2024-04-24T14:30:44.536-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-24T14:30:44.536-04:00 level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1496169119/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-1436d66b69757a245f02d000874c670507949d11ad5c188a623652052c6aa508 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --port 38491"
time=2024-04-24T14:30:44.536-04:00 level=INFO source=server.go:389 msg="waiting for llama runner to start responding"
{"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"139817808982016","timestamp":1713983444}
{"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"139817808982016","timestamp":1713983444}
{"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":32,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"139817808982016","timestamp":1713983444,"total_threads":64}
llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from /root/.ollama/models/blobs/sha256-1436d66b69757a245f02d000874c670507949d11ad5c188a623652052c6aa508 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 80
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_0:  561 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 260/32016 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 36.20 GiB (4.51 BPW)
llm_load_print_meta: general.name     = codellama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.55 MiB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:        CPU buffer size =   140.70 MiB
llm_load_tensors:      CUDA0 buffer size = 36930.21 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   640.00 MiB
llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.15 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   324.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    20.01 MiB
llama_new_context_with_model: graph nodes  = 2566
llama_new_context_with_model: graph splits = 2
{"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"139817808982016","timestamp":1713983453}
{"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"139817808982016","timestamp":1713983453}
{"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"139817808982016","timestamp":1713983453}
{"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"63","port":"38491","tid":"139817808982016","timestamp":1713983453}
{"function":"update_slots","level":"INFO","line":1578,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"139817808982016","timestamp":1713983453}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":7726,"tid":"139817808982016","timestamp":1713984223}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":7727,"tid":"139817808982016","timestamp":1713984223}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223}
{"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":7728,"tid":"139817808982016","timestamp":1713984223}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223}
{"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984223}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1809,"msg":"slot progression","n_past":4,"n_past_se":0,"n_prompt_tokens_processed":300,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984223}
{"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":4,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984223}
{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984352}
{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984428}
{"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984503}
<!-- gh-comment-id:2079972244 --> @kirill-vas commented on GitHub (Apr 26, 2024): Also still experiencing hangs when calling `/api/chat` endpoint. Running on HumanEval benchmark (164 samples), usually fails about 70-80 calls. Requires `ollama serve` restart to recover. Mostly happens with CodeLlama-70b rather than the smaller models (13b, 7b; only tested these). Running v0.1.32 on Ubuntu 22.04.2 with NVIDIA RTX A6000, driver 530.30.02, CUDA 12.1, using a Docker container The specific part where it seems to loop indefinitely is the `update_slots` function with `"msg":"slot context shift"` line from `ollama serve` logs (full logs below): ``` {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984503} ``` Code that calls the endpoint: ``` url = "http://localhost:11434/api/chat" params = { 'model': 'codellama:70b-instruct', 'options': { 'num_ctx': 4096 }, 'messages': [ {'role': 'system', 'content': sys_prompt}, {"role": "user", "content": user_prompt}, ], 'options': { 'seed': 123, 'temperature': 0.2, }, 'stream': False, 'keep_alive': 10 } response = requests.post(url, json=params).json() ``` Full log of the run below: ``` time=2024-04-24T14:30:43.820-04:00 level=INFO source=gpu.go:121 msg="Detecting GPU type" time=2024-04-24T14:30:43.820-04:00 level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" time=2024-04-24T14:30:43.824-04:00 level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1496169119/runners/cuda_v11/libcudart.so.11.0]" time=2024-04-24T14:30:44.105-04:00 level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" time=2024-04-24T14:30:44.105-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-24T14:30:44.387-04:00 level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6" time=2024-04-24T14:30:44.437-04:00 level=INFO source=gpu.go:121 msg="Detecting GPU type" time=2024-04-24T14:30:44.437-04:00 level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" time=2024-04-24T14:30:44.440-04:00 level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1496169119/runners/cuda_v11/libcudart.so.11.0]" time=2024-04-24T14:30:44.440-04:00 level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" time=2024-04-24T14:30:44.440-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-24T14:30:44.505-04:00 level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6" time=2024-04-24T14:30:44.536-04:00 level=INFO source=server.go:127 msg="offload to gpu" reallayers=81 layers=81 required="38351.2 MiB" used="38351.2 MiB" available="45354.3 MiB" kv="640.0 MiB" fulloffload="324.0 MiB" partialoffload="348.0 MiB" time=2024-04-24T14:30:44.536-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-24T14:30:44.536-04:00 level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1496169119/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-1436d66b69757a245f02d000874c670507949d11ad5c188a623652052c6aa508 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --port 38491" time=2024-04-24T14:30:44.536-04:00 level=INFO source=server.go:389 msg="waiting for llama runner to start responding" {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"139817808982016","timestamp":1713983444} {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"139817808982016","timestamp":1713983444} {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":32,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"139817808982016","timestamp":1713983444,"total_threads":64} llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from /root/.ollama/models/blobs/sha256-1436d66b69757a245f02d000874c670507949d11ad5c188a623652052c6aa508 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = codellama llama_model_loader: - kv 2: llama.context_length u32 = 2048 llama_model_loader: - kv 3: llama.embedding_length u32 = 8192 llama_model_loader: - kv 4: llama.block_count u32 = 80 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 64 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q4_0: 561 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 260/32016 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32016 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 68.98 B llm_load_print_meta: model size = 36.20 GiB (4.51 BPW) llm_load_print_meta: general.name = codellama llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.55 MiB llm_load_tensors: offloading 80 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 81/81 layers to GPU llm_load_tensors: CPU buffer size = 140.70 MiB llm_load_tensors: CUDA0 buffer size = 36930.21 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 640.00 MiB llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.15 MiB llama_new_context_with_model: CUDA0 compute buffer size = 324.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB llama_new_context_with_model: graph nodes = 2566 llama_new_context_with_model: graph splits = 2 {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"139817808982016","timestamp":1713983453} {"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"139817808982016","timestamp":1713983453} {"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"139817808982016","timestamp":1713983453} {"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"63","port":"38491","tid":"139817808982016","timestamp":1713983453} {"function":"update_slots","level":"INFO","line":1578,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"139817808982016","timestamp":1713983453} {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":7726,"tid":"139817808982016","timestamp":1713984223} {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223} {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":7727,"tid":"139817808982016","timestamp":1713984223} {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223} {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223} {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":7728,"tid":"139817808982016","timestamp":1713984223} {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":40168,"status":200,"tid":"139814867263488","timestamp":1713984223} {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984223} {"function":"update_slots","ga_i":0,"level":"INFO","line":1809,"msg":"slot progression","n_past":4,"n_past_se":0,"n_prompt_tokens_processed":300,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984223} {"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":4,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984223} {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984352} {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984428} {"function":"update_slots","level":"INFO","line":1601,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slot_id":0,"task_id":7729,"tid":"139817808982016","timestamp":1713984503} ```
Author
Owner

@omani commented on GitHub (Apr 26, 2024):

I dont understand the hurry to close this issue without getting enough feedback first. where have you learned this @BruceMacD ? or is this normal procedure in your dev workflow?

<!-- gh-comment-id:2080221685 --> @omani commented on GitHub (Apr 26, 2024): I dont understand the hurry to close this issue without getting enough feedback first. where have you learned this @BruceMacD ? or is this normal procedure in your dev workflow?
Author
Owner

@EmanueleLenzi92 commented on GitHub (Apr 27, 2024):

I think I have the same problem. After a few runs, the ollama server crashes and stops to generate text. I'm using windows 11 (wsl ubuntu) and langchain. I have a rtx 4090 and I tried from 0.1.16 to 0.1.19, but all of them have this issue in my case. instead, on a laptop with windows 10 and with an nvidia T500, I don't have this problem.

I still have this problem with 0.1.32 version with rtx 4090 and windows 11 (on wsl ubuntu).
After a few run (less then 10), the Ollama server is stuck and i can't access anymore to "localhost:11434/" unless i kill the process

<!-- gh-comment-id:2080419193 --> @EmanueleLenzi92 commented on GitHub (Apr 27, 2024): > I think I have the same problem. After a few runs, the ollama server crashes and stops to generate text. I'm using windows 11 (wsl ubuntu) and langchain. I have a rtx 4090 and I tried from 0.1.16 to 0.1.19, but all of them have this issue in my case. instead, on a laptop with windows 10 and with an nvidia T500, I don't have this problem. I still have this problem with 0.1.32 version with rtx 4090 and windows 11 (on wsl ubuntu). After a few run (less then 10), the Ollama server is stuck and i can't access anymore to "localhost:11434/" unless i kill the process
Author
Owner

@frederick-wang commented on GitHub (Apr 27, 2024):

Got the same bug with A100 on Ubuntu 22.04. ollama version is 0.1.32.

<!-- gh-comment-id:2081097477 --> @frederick-wang commented on GitHub (Apr 27, 2024): Got the same bug with A100 on Ubuntu 22.04. ollama version is 0.1.32.
Author
Owner

@frederick-wang commented on GitHub (Apr 27, 2024):

Thanks to everyone for reporting and testing this. Marking this as resolved for now pending any more reports.

‌‌‌Sorry bro @BruceMacD, I found that this issue has not been resolved. I encountered the same stuck issue yesterday (ollama 0.1.32, A100, Ubuntu 22.04) and had to restart to resolve it.

<!-- gh-comment-id:2081099776 --> @frederick-wang commented on GitHub (Apr 27, 2024): > Thanks to everyone for reporting and testing this. Marking this as resolved for now pending any more reports. ‌‌‌Sorry bro @BruceMacD, I found that this issue has not been resolved. I encountered the same stuck issue yesterday (ollama 0.1.32, A100, Ubuntu 22.04) and had to restart to resolve it.
Author
Owner

@ckehagioglou commented on GitHub (Apr 28, 2024):

Same bug here. Mac M2 Max Studio hangs after several questions being asked.

<!-- gh-comment-id:2081335874 --> @ckehagioglou commented on GitHub (Apr 28, 2024): Same bug here. Mac M2 Max Studio hangs after several questions being asked.
Author
Owner

@BruceMacD commented on GitHub (Apr 28, 2024):

Thanks for the reports, re-opening this.

Couple of questions to help me reproduce:

  • What models are people seeing this on?
  • Are you inputting long prompts/context when it gets stuck?
<!-- gh-comment-id:2081336779 --> @BruceMacD commented on GitHub (Apr 28, 2024): Thanks for the reports, re-opening this. Couple of questions to help me reproduce: - What models are people seeing this on? - Are you inputting long prompts/context when it gets stuck?
Author
Owner

@airbj31 commented on GitHub (Apr 28, 2024):

Thanks for the reports, re-opening this.

Couple of questions to help me reproduce:

  • What models are people seeing this on?
  • Are you inputting long prompts/context when it gets stuck?
  1. mistral, mistral:instruct, llama3
  2. I usually use the LLM model to summarize text. The inputs are normally less than 50 ~500 words.
<!-- gh-comment-id:2081340805 --> @airbj31 commented on GitHub (Apr 28, 2024): > Thanks for the reports, re-opening this. > > Couple of questions to help me reproduce: > > * What models are people seeing this on? > * Are you inputting long prompts/context when it gets stuck? 1. mistral, mistral:instruct, llama3 2. I usually use the LLM model to summarize text. The inputs are normally less than 50 ~500 words.
Author
Owner

@EmanueleLenzi92 commented on GitHub (Apr 28, 2024):

Thanks for the reports, re-opening this.

Couple of questions to help me reproduce:

  • What models are people seeing this on?
  • Are you inputting long prompts/context when it gets stuck?
  1. llama2 7b and 13b, llama3 8b
  2. I use about 300 words in the prompts
<!-- gh-comment-id:2081389108 --> @EmanueleLenzi92 commented on GitHub (Apr 28, 2024): > Thanks for the reports, re-opening this. > > Couple of questions to help me reproduce: > > * What models are people seeing this on? > * Are you inputting long prompts/context when it gets stuck? 1. llama2 7b and 13b, llama3 8b 2. I use about 300 words in the prompts
Author
Owner

@dhiltgen commented on GitHub (Apr 28, 2024):

The pre-release for 0.1.33 is available now, which should resolve these long context hang/loop problems.

<!-- gh-comment-id:2081585834 --> @dhiltgen commented on GitHub (Apr 28, 2024): The pre-release for [0.1.33](https://github.com/ollama/ollama/releases) is available now, which should resolve these long context hang/loop problems.
Author
Owner

@syrom commented on GitHub (Apr 28, 2024):

@dhiltgen Great news, thank you: will try asap after I have the update installed.
FYI, the situation has alread improved considerably - but hangups still there are with 0.1.32.
I experienced a hangup after having Ollama / Mixtral churn thru a large text file for > 12 h, extracting semanting information from it.
Setup: M1 Powerbook with 64 GB RAM and Ollama 0.1.32.
The text had 623 chunks with 1000 characters each (plus another ca. 400 characters prompt size) - and the hangup occured after processing 517 of these chunks.

<!-- gh-comment-id:2081625117 --> @syrom commented on GitHub (Apr 28, 2024): @dhiltgen Great news, thank you: will try asap after I have the update installed. FYI, the situation has alread improved considerably - but hangups still there are with 0.1.32. I experienced a hangup after having Ollama / Mixtral churn thru a large text file for > 12 h, extracting semanting information from it. Setup: M1 Powerbook with 64 GB RAM and Ollama 0.1.32. The text had 623 chunks with 1000 characters each (plus another ca. 400 characters prompt size) - and the hangup occured after processing 517 of these chunks.
Author
Owner

@WeirdCarrotMonster commented on GitHub (Apr 28, 2024):

I can still encounter this problem on 0.1.33: ollama gets stuck after 15 minutes of embeddings processing (using nomic-embed-text). Last log lines:

ollama[8937]: [GIN] 2024/04/28 - 20:12:17 | 200 |  155.784253ms |  100.112.67.113 | POST     "/api/embeddings"
ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31361,"tid":"140302648860672","timestamp":1714335137}
ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41614,"status":200,"tid":"140301227372544","timestamp":1714335137}
ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31362,"tid":"140302648860672","timestamp":1714335137}
ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41614,"status":200,"tid":"140301227372544","timestamp":1714335137}
ollama[9198]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":31363,"tid":"140302648860672","timestamp":1714335137}
ollama[9198]: {"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":31363,"tid":"140302648860672","timestamp":1714335137}
ollama[9198]: {"function":"update_slots","level":"INFO","line":1640,"msg":"slot released","n_cache_tokens":103,"n_ctx":2048,"n_past":103,"n_system_tokens":0,"slot_id":0,"task_id":31363,"tid":"140302648860672","timestamp":1714335137,"truncated":false}
ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/embedding","remote_addr":"127.0.0.1","remote_port":41618,"status":200,"tid":"140301247180800","timestamp":1714335137}
ollama[8937]: [GIN] 2024/04/28 - 20:12:17 | 200 |   49.744207ms |  100.112.67.113 | POST     "/api/embeddings"
ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31366,"tid":"140302648860672","timestamp":1714335137}
ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41618,"status":200,"tid":"140301247180800","timestamp":1714335137}
ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31367,"tid":"140302648860672","timestamp":1714335137}
ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41618,"status":200,"tid":"140301247180800","timestamp":1714335137}
ollama[9198]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":31368,"tid":"140302648860672","timestamp":1714335137}

GPU: NVIDIA GeForce RTX 3060
Driver version: 550.54.14
CUDA Version: 12.4

<!-- gh-comment-id:2081640588 --> @WeirdCarrotMonster commented on GitHub (Apr 28, 2024): I can still encounter this problem on 0.1.33: ollama gets stuck after 15 minutes of embeddings processing (using nomic-embed-text). Last log lines: ``` ollama[8937]: [GIN] 2024/04/28 - 20:12:17 | 200 | 155.784253ms | 100.112.67.113 | POST "/api/embeddings" ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31361,"tid":"140302648860672","timestamp":1714335137} ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41614,"status":200,"tid":"140301227372544","timestamp":1714335137} ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31362,"tid":"140302648860672","timestamp":1714335137} ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41614,"status":200,"tid":"140301227372544","timestamp":1714335137} ollama[9198]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":31363,"tid":"140302648860672","timestamp":1714335137} ollama[9198]: {"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":31363,"tid":"140302648860672","timestamp":1714335137} ollama[9198]: {"function":"update_slots","level":"INFO","line":1640,"msg":"slot released","n_cache_tokens":103,"n_ctx":2048,"n_past":103,"n_system_tokens":0,"slot_id":0,"task_id":31363,"tid":"140302648860672","timestamp":1714335137,"truncated":false} ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/embedding","remote_addr":"127.0.0.1","remote_port":41618,"status":200,"tid":"140301247180800","timestamp":1714335137} ollama[8937]: [GIN] 2024/04/28 - 20:12:17 | 200 | 49.744207ms | 100.112.67.113 | POST "/api/embeddings" ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31366,"tid":"140302648860672","timestamp":1714335137} ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41618,"status":200,"tid":"140301247180800","timestamp":1714335137} ollama[9198]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":31367,"tid":"140302648860672","timestamp":1714335137} ollama[9198]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":41618,"status":200,"tid":"140301247180800","timestamp":1714335137} ollama[9198]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":31368,"tid":"140302648860672","timestamp":1714335137} ``` GPU: NVIDIA GeForce RTX 3060 Driver version: 550.54.14 CUDA Version: 12.4
Author
Owner

@janis-inzpire commented on GitHub (May 3, 2024):

Just to add to this a bit - looks like we are experiencing same issue.
Running llava model, gets stuck every 15 - 20 minutes. Sometimes gets stuck after just 4 requests.
We are using API to call the endpoint. running version 0.1.33 though a docker container.

<!-- gh-comment-id:2092408454 --> @janis-inzpire commented on GitHub (May 3, 2024): Just to add to this a bit - looks like we are experiencing same issue. Running llava model, gets stuck every 15 - 20 minutes. Sometimes gets stuck after just 4 requests. We are using API to call the endpoint. running version 0.1.33 though a docker container.
Author
Owner

@ukrolelo commented on GitHub (May 5, 2024):

+1 stuck on question in different language.

You 05/05/2024 18:18
то ти розуміеш украйську мову?



profile
Llama3:Latest 
Корисно запитання!

Я можу зрозуміти основні фрази і слова на українській мові, оскільки я вивчав великий корпус текстів різних мов, включно з українською.


Однак, моя комп'ютерна профікувність українській мові може бути обмеженою порівняно з профікувністю кореспондента-українця. Я можу зробити помилки в граматиці, морф

remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939}
{"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":832,"tid":"139900887961600","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939}
{"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":833,"tid":"139900887961600","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939}
{"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":834,"tid":"139900887961600","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939}
{"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":835,"tid":"139900887961600","timestamp":1714925939}
{"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939}
{"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":836,"tid":"139900887961600","timestamp":1714925939}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1812,"msg":"slot progression","n_past":1085,"n_past_se":0,"n_prompt_tokens_processed":307,"slot_id":0,"task_id":836,"tid":"139900887961600","timestamp":1714925939}
{"function":"update_slots","level":"INFO","line":1839,"msg":"kv cache rm [p0, end)","p0":1085,"slot_id":0,"task_id":836,"tid":"139900887961600","timestamp":1714925939}
<!-- gh-comment-id:2094866527 --> @ukrolelo commented on GitHub (May 5, 2024): +1 stuck on question in different language. ``` You 05/05/2024 18:18 то ти розуміеш украйську мову? profile Llama3:Latest Корисно запитання! Я можу зрозуміти основні фрази і слова на українській мові, оскільки я вивчав великий корпус текстів різних мов, включно з українською. Однак, моя комп'ютерна профікувність українській мові може бути обмеженою порівняно з профікувністю кореспондента-українця. Я можу зробити помилки в граматиці, морф ``` ``` remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939} {"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":832,"tid":"139900887961600","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47034,"status":200,"tid":"139900095553536","timestamp":1714925939} {"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":833,"tid":"139900887961600","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939} {"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":834,"tid":"139900887961600","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939} {"function":"process_single_task","level":"INFO","line":1509,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":835,"tid":"139900887961600","timestamp":1714925939} {"function":"log_server_request","level":"INFO","line":2737,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":47036,"status":200,"tid":"139900087160832","timestamp":1714925939} {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":836,"tid":"139900887961600","timestamp":1714925939} {"function":"update_slots","ga_i":0,"level":"INFO","line":1812,"msg":"slot progression","n_past":1085,"n_past_se":0,"n_prompt_tokens_processed":307,"slot_id":0,"task_id":836,"tid":"139900887961600","timestamp":1714925939} {"function":"update_slots","level":"INFO","line":1839,"msg":"kv cache rm [p0, end)","p0":1085,"slot_id":0,"task_id":836,"tid":"139900887961600","timestamp":1714925939} ```
Author
Owner

@syrom commented on GitHub (May 6, 2024):

A quick feedback: from my perspective, the bug is solved as far as Ollama running on Mac Silicon is concerned. I was never able to to process more than ca. 120 text chunks of a size of 1.000 characters in one go on an M1 Pro Mac. Now, with the upate to 0.1.33, the computer ran for 24 h nonstop, processing ca. 630 text chunks of a larger document to extract information from it.... and did so to the very end.
Simply: Thanks !

<!-- gh-comment-id:2096758682 --> @syrom commented on GitHub (May 6, 2024): A quick feedback: from my perspective, the bug is solved as far as Ollama running on Mac Silicon is concerned. I was never able to to process more than ca. 120 text chunks of a size of 1.000 characters in one go on an M1 Pro Mac. Now, with the upate to 0.1.33, the computer ran for 24 h nonstop, processing ca. 630 text chunks of a larger document to extract information from it.... and did so to the very end. Simply: Thanks !
Author
Owner

@maciejmajek commented on GitHub (May 9, 2024):

Still happens to me with llava models @ ollama v0.1.34
Interestingly, Ollama only freezes up when I use the /chat endpoint with both image and text data. It works fine when only text is sent.
I've noticed that the problem gets worse when I hit the /chat endpoint with multiple prompts at once using Ollama's queuing system. It tends to hang after about 30 seconds...

Setup:
2x RTX 4090
13900k

logs:

Last succesful chat call

[GIN] 2024/05/09 - 18:36:57 | 200 | 8.140684188s | 10.244.163.252 | POST "/api/chat"
time=2024-05-09T18:36:57.971+02:00 level=DEBUG source=sched.go:294 msg="context for request finished"
time=2024-05-09T18:36:57.971+02:00 level=DEBUG source=sched.go:232 msg="runner with non-zero duration has gone idle, adding timer" model=/usr/share/ollama/.ollama/models/blobs/sha256-1834da0de12e8d8c4cce928b0020f25311d5fca5ae77be8fc9039f8bcda1833a duration=5m0s
time=2024-05-09T18:36:57.971+02:00 level=DEBUG source=sched.go:248 msg="after processing request finished event" model=/usr/share/ollama/.ollama/models/blobs/sha256-1834da0de12e8d8c4cce928b0020f25311d5fca5ae77be8fc9039f8bcda1833a refCount=0
time=2024-05-09T18:36:59.457+02:00 level=DEBUG source=sched.go:435 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-1834da0de12e8d8c4cce928b0020f25311d5fca5ae77be8fc9039f8bcda1833a
{"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":850,"tid":"140043376336896","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619}
{"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":851,"tid":"140043376336896","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619}
{"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":852,"tid":"140043376336896","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619}
time=2024-05-09T18:36:59.594+02:00 level=DEBUG source=prompt.go:172 msg="prompt now fits in context window" required=1988 window=2048
time=2024-05-09T18:36:59.595+02:00 level=DEBUG source=routes.go:1241 msg="chat handler" prompt="<|im_start|>system\n<|im_end|>\n<|im_start|>user\n- blah blah <|im_start|>system\n<|im_end|>\n<|im_start|>user\n[img-0] [img-1] input: two consecutive images blah blah <|im_end|>\n<|im_start|>assistant\n" images=2
time=2024-05-09T18:36:59.595+02:00 level=DEBUG source=server.go:591 msg="setting token limit to 10x num_ctx" num_ctx=2048 num_predict=20480
{"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":853,"tid":"140043376336896","timestamp":1715272619}
{"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49272,"status":200,"tid":"140038952120320","timestamp":1715272619}
{"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":854,"tid":"140043376336896","timestamp":1715272619}
{"function":"update_slots","level":"INFO","line":1837,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":854,"tid":"140043376336896","timestamp":1715272619}

<!-- gh-comment-id:2102998725 --> @maciejmajek commented on GitHub (May 9, 2024): Still happens to me with llava models @ ollama v0.1.34 Interestingly, Ollama only freezes up when I use the /chat endpoint with both image and text data. It works fine when only text is sent. I've noticed that the problem gets worse when I hit the /chat endpoint with multiple prompts at once using Ollama's queuing system. It tends to hang after about 30 seconds... Setup: 2x RTX 4090 13900k logs: <details><summary>Last succesful chat call </summary> <p> [GIN] 2024/05/09 - 18:36:57 | 200 | 8.140684188s | 10.244.163.252 | POST "/api/chat" time=2024-05-09T18:36:57.971+02:00 level=DEBUG source=sched.go:294 msg="context for request finished" time=2024-05-09T18:36:57.971+02:00 level=DEBUG source=sched.go:232 msg="runner with non-zero duration has gone idle, adding timer" model=/usr/share/ollama/.ollama/models/blobs/sha256-1834da0de12e8d8c4cce928b0020f25311d5fca5ae77be8fc9039f8bcda1833a duration=5m0s time=2024-05-09T18:36:57.971+02:00 level=DEBUG source=sched.go:248 msg="after processing request finished event" model=/usr/share/ollama/.ollama/models/blobs/sha256-1834da0de12e8d8c4cce928b0020f25311d5fca5ae77be8fc9039f8bcda1833a refCount=0 time=2024-05-09T18:36:59.457+02:00 level=DEBUG source=sched.go:435 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-1834da0de12e8d8c4cce928b0020f25311d5fca5ae77be8fc9039f8bcda1833a {"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":850,"tid":"140043376336896","timestamp":1715272619} {"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619} {"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":851,"tid":"140043376336896","timestamp":1715272619} {"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619} {"function":"log_server_request","level":"INFO","line":2735,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619} {"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":852,"tid":"140043376336896","timestamp":1715272619} {"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619} {"function":"log_server_request","level":"INFO","line":2735,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":49266,"status":200,"tid":"140039010869248","timestamp":1715272619} time=2024-05-09T18:36:59.594+02:00 level=DEBUG source=prompt.go:172 msg="prompt now fits in context window" required=1988 window=2048 time=2024-05-09T18:36:59.595+02:00 level=DEBUG source=routes.go:1241 msg="chat handler" prompt="<|im_start|>system\n<|im_end|>\n<|im_start|>user\n- blah blah <|im_start|>system\n<|im_end|>\n<|im_start|>user\n[img-0] [img-1] input: two consecutive images blah blah <|im_end|>\n<|im_start|>assistant\n" images=2 time=2024-05-09T18:36:59.595+02:00 level=DEBUG source=server.go:591 msg="setting token limit to 10x num_ctx" num_ctx=2048 num_predict=20480 {"function":"process_single_task","level":"INFO","line":1507,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":853,"tid":"140043376336896","timestamp":1715272619} {"function":"log_server_request","level":"INFO","line":2735,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":49272,"status":200,"tid":"140038952120320","timestamp":1715272619} {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":854,"tid":"140043376336896","timestamp":1715272619} {"function":"update_slots","level":"INFO","line":1837,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":854,"tid":"140043376336896","timestamp":1715272619} </p> </details>
Author
Owner

@mironnn commented on GitHub (May 16, 2024):

The same. still, have issues
ollama version 0.1.38
RTXA6000
Llama3:70b
Hangs on <10 requests

<!-- gh-comment-id:2115411169 --> @mironnn commented on GitHub (May 16, 2024): The same. still, have issues ollama version 0.1.38 RTXA6000 Llama3:70b Hangs on <10 requests
Author
Owner

@quaintdev commented on GitHub (May 23, 2024):

Has happened with multiple models for me. My prompt is usually just a line. I have seen that it happens if I kept it idle for sometime. When I come back the responses are stuck. I am running on CPU. The first prompt after starting ollama serve command always gets quick response. I am on 0.1.37

Edit: Happens on 0.1.38 too. I always see something like below when this happens

image

With easyllama and I don't see such issue.

<!-- gh-comment-id:2127762023 --> @quaintdev commented on GitHub (May 23, 2024): Has happened with multiple models for me. My prompt is usually just a line. I have seen that it happens if I kept it idle for sometime. When I come back the responses are stuck. I am running on CPU. The first prompt after starting `ollama serve` command always gets quick response. I am on `0.1.37` Edit: Happens on `0.1.38` too. I always see something like below when this happens ![image](https://github.com/ollama/ollama/assets/59229571/2505451e-2b71-4326-b3a6-9e83b0470a8b) With easyllama and I don't see such issue.
Author
Owner

@sammcj commented on GitHub (May 26, 2024):

Hi all, give this fix a go: https://github.com/ollama/ollama/issues/4604#issuecomment-2130436000

<!-- gh-comment-id:2132039238 --> @sammcj commented on GitHub (May 26, 2024): Hi all, give this fix a go: https://github.com/ollama/ollama/issues/4604#issuecomment-2130436000
Author
Owner

@quaintdev commented on GitHub (May 26, 2024):

I'm not using Docker so I don't think this fix is applicable to me.

<!-- gh-comment-id:2132071135 --> @quaintdev commented on GitHub (May 26, 2024): I'm not using Docker so I don't think this fix is applicable to me.
Author
Owner

@jak4 commented on GitHub (May 27, 2024):

I'm experiencing a similar issue. I'm running on a virtualized VM with a Tesla P40. After booting the VM everything works, but after a while, when the server idles, it stops working. Neither requests from a frontend nor from the cli (e.g. ollama run llama3) work. With the cli it just never starts up. The log files dont show anything suspicous. "service ollama restart" does nothing. The only thing that is maybe not aligned is, that I'm having CUDA 12.2 installed but the runner is using v11.
Edit: Version 0.1.38

Edit: Also happens on Version 0.1.39. What is maybe interessting is, that this happens regardless of running queries against the LLM or not. After booting the VM and not running any query for an unspecified amount of time, but less than 2 h, ollama becomes/is unresponsive. It seems the model gets loaded, but doesnt finish. After loading the model up to a certain point, with LLama 3 to around 4800 MiB of GPU RAM, the loading slows to a crawl and the GPU RAM usage increases at 2 MiB for every few (two seconds?). At some point it increases with 6 MiB per every few seconds (at 4922 MiB), and then stops completly (at 4934 MiB). After a while the process stops and the GPU RAM is completly empty again.

When comparing a working ollama instance to a non-responsive instance the load speed for the model is way higher when everything works out. The model I used for this testing uses 4934 MiB when fully loaded. Which tracks with the number above.

<!-- gh-comment-id:2133875698 --> @jak4 commented on GitHub (May 27, 2024): I'm experiencing a similar issue. I'm running on a virtualized VM with a Tesla P40. After booting the VM everything works, but after a while, when the server idles, it stops working. Neither requests from a frontend nor from the cli (e.g. ollama run llama3) work. With the cli it just never starts up. The log files dont show anything suspicous. "service ollama restart" does nothing. The only thing that is maybe not aligned is, that I'm having CUDA 12.2 installed but the runner is using v11. Edit: Version 0.1.38 Edit: Also happens on Version 0.1.39. What is maybe interessting is, that this happens regardless of running queries against the LLM or not. After booting the VM and not running any query for an unspecified amount of time, but less than 2 h, ollama becomes/is unresponsive. It seems the model gets loaded, but doesnt finish. After loading the model up to a certain point, with LLama 3 to around 4800 MiB of GPU RAM, the loading slows to a crawl and the GPU RAM usage increases at 2 MiB for every few (two seconds?). At some point it increases with 6 MiB per every few seconds (at 4922 MiB), and then stops completly (at 4934 MiB). After a while the process stops and the GPU RAM is completly empty again. When comparing a working ollama instance to a non-responsive instance the load speed for the model is way higher when everything works out. The model I used for this testing uses 4934 MiB when fully loaded. Which tracks with the number above.
Author
Owner

@blubbsy commented on GitHub (May 28, 2024):

Thanks for the reports, re-opening this.

Couple of questions to help me reproduce:

* What models are people seeing this on?

* Are you inputting long prompts/context when it gets stuck?

i'm seeing the same problem on Windows. I'm using llava:v1.6 and pass the images through bind(...) as base64 and then invoke the promp. works fine for few prompts and then stops.
i'm at the moment checking if maybe something else could be wrong, but as my experience fits to what i read here i want to mention it.

<!-- gh-comment-id:2136128609 --> @blubbsy commented on GitHub (May 28, 2024): > Thanks for the reports, re-opening this. > > Couple of questions to help me reproduce: > > * What models are people seeing this on? > > * Are you inputting long prompts/context when it gets stuck? i'm seeing the same problem on Windows. I'm using llava:v1.6 and pass the images through bind(...) as base64 and then invoke the promp. works fine for few prompts and then stops. i'm at the moment checking if maybe something else could be wrong, but as my experience fits to what i read here i want to mention it.
Author
Owner

@jak4 commented on GitHub (May 31, 2024):

I'm seeing the same issues with vLLM which indicates a problem with some underlying libraries, e.g. torch or maybe something CUDA. What is fascinating is, that this has apparently nothing to do with time between requests or even "going into sleep mode" since I managed to perform a query to vLLM which was working perfectly with around 15 tokens per second and then slowed to a crawl with 0.1 tokens per second. So even while the generation is running this stuff happens.

<!-- gh-comment-id:2142423524 --> @jak4 commented on GitHub (May 31, 2024): I'm seeing the same issues with vLLM which indicates a problem with some underlying libraries, e.g. torch or maybe something CUDA. What is fascinating is, that this has apparently nothing to do with time between requests or even "going into sleep mode" since I managed to perform a query to vLLM which was working perfectly with around 15 tokens per second and then slowed to a crawl with 0.1 tokens per second. So even while the generation is running this stuff happens.
Author
Owner

@jak4 commented on GitHub (Jun 1, 2024):

I have resolved my issue. It had nothing todo with ollama, vllm, or any other part of the software stack. It was a LICENSING issue. I simply forgot to aquire a license for the vGPU the VM was using. So after a while the nvidia driver degraded the performance of the vGPU to become basically unusable.

<!-- gh-comment-id:2143338390 --> @jak4 commented on GitHub (Jun 1, 2024): I have resolved my issue. It had nothing todo with ollama, vllm, or any other part of the software stack. It was a LICENSING issue. I simply forgot to aquire a license for the vGPU the VM was using. So after a while the nvidia driver degraded the performance of the vGPU to become basically unusable.
Author
Owner

@mchiang0610 commented on GitHub (Jun 1, 2024):

@jak4 thank you for letting us know about this. May I ask what was the VM provider so we know in the future what to lookout for?

<!-- gh-comment-id:2143556917 --> @mchiang0610 commented on GitHub (Jun 1, 2024): @jak4 thank you for letting us know about this. May I ask what was the VM provider so we know in the future what to lookout for?
Author
Owner

@jak4 commented on GitHub (Jun 3, 2024):

@mchiang0610 I'm unsure what you mean bei VM provider, but I'm running a homelab with PROXMOX as the VM host and Tesla P40. The guest is a Debian 12 instance.

<!-- gh-comment-id:2145525820 --> @jak4 commented on GitHub (Jun 3, 2024): @mchiang0610 I'm unsure what you mean bei VM provider, but I'm running a homelab with PROXMOX as the VM host and Tesla P40. The guest is a Debian 12 instance.
Author
Owner

@azeezabdikarim commented on GitHub (Jun 6, 2024):

I am having the same issue running:
ollama 0.1.41
M2 Max with 64gb ram

I was initially using the 'llava' model, which would hang after ~10 image and prompt pairs. Now I have switched to 'llava-llama3' and am able to process ~20 requests before it hangs.

<!-- gh-comment-id:2151791753 --> @azeezabdikarim commented on GitHub (Jun 6, 2024): I am having the same issue running: ollama 0.1.41 M2 Max with 64gb ram I was initially using the 'llava' model, which would hang after ~10 image and prompt pairs. Now I have switched to 'llava-llama3' and am able to process ~20 requests before it hangs.
Author
Owner

@Luzifer commented on GitHub (Jun 6, 2024):

With llama3:8b and dolphin-mistral:7b the v0.1.41 produces complete garbage after some prompts. Downgrading to v0.1.38 solved this for me: both models behave properly as before. (Both versions built through the Archlinux build process.)

<!-- gh-comment-id:2153058364 --> @Luzifer commented on GitHub (Jun 6, 2024): With `llama3:8b` and `dolphin-mistral:7b` the `v0.1.41` produces complete garbage after some prompts. Downgrading to `v0.1.38` solved this for me: both models behave properly as before. (Both versions built through the Archlinux build process.)
Author
Owner

@dhiltgen commented on GitHub (Jun 6, 2024):

I haven't been able to reproduce this yet

On an M3 mac, the following loops at least 80+ times without problem:

C=0; while ~/ollama-darwin run llava --verbose please describe the contents of the following image ./image.jpg ; do C=$(($C+1)); echo $C; done

A cuda windows system with this also loops cleanly

$C=0; while($true) { &ollama run --verbose llava please describe the contents of the following image ./image.jpg ; ++$C; write-output $C }

Perhaps there are some non-default settings being passed via API clients that is causing the hang? Can anyone share a minimal curl loop that repro's?

<!-- gh-comment-id:2153547686 --> @dhiltgen commented on GitHub (Jun 6, 2024): I haven't been able to reproduce this yet On an M3 mac, the following loops at least 80+ times without problem: ``` C=0; while ~/ollama-darwin run llava --verbose please describe the contents of the following image ./image.jpg ; do C=$(($C+1)); echo $C; done ``` A cuda windows system with this also loops cleanly ``` $C=0; while($true) { &ollama run --verbose llava please describe the contents of the following image ./image.jpg ; ++$C; write-output $C } ``` Perhaps there are some non-default settings being passed via API clients that is causing the hang? Can anyone share a minimal [curl](https://github.com/ollama/ollama/blob/main/docs/api.md#request-with-images) loop that repro's?
Author
Owner

@Luzifer commented on GitHub (Jun 7, 2024):

It looks like it's way easier to break with a model derived from llama3:8b with a longer system prompt (I'm not able to share) than with the plain llama3:8b but eventually a chat (OpenWebUI) with the llama3:8b also broke down: chat-Unseen Backyard Secret Revealed.txt

Just guessing: as a longer system prompt model breaks earlier I'd say the bigger the amount of text, the earlier it breaks.

After that linked chat even ollama runs are producing garbage:

Run Output
# ollama run llama3:8b "Please write a story of a girl wandering into the forest, discovering all the secrets of the forest. Include her feelings and throughts. Ignore length limits."
As she stepped out of a secret.

As Sophia had always felt a name her age of the woods behind her heart was a Whispering Woods' whispers of course I'd ever saw before - ancient trees towered withered clearing
surrounded by towering trees, their branches that most vividly colored leaves and wildflowers, there was a massive stone statue of twisted tree, its trunk twisted with
age-wrinkled with a small streams of the sun began to me felt like my feetprints of hidden treasures: shiny rocks that had made their in a cozy little den beneath the ancient
tree; a
, a hidden pools of crystal clear and filled with crystal-clear water crystal clear; a tiny fish swimming holes in old, abandoned cabin that seemed carved withered from the
perfect resting place for centuries.

As I could be enjoyed.
As I spent hours passed the stories like any old trees or creatures, though those secrets are home to the whispers. As you see things like, fireflies across the sky on clear
nights; heard the moon cast my hair-raised magic on the wings as they flit about the treetoak
The forest itself – the whispers that faint murmurs of leaves that seem to come from nowhere and everywhere at the same time. Some people might be back inched secret glade of
hidden trails that crisscross the forest, each one day, I've been, while others are overgrown and require a veil or camouflaged by mist. But as you follow the forest is sharing
its secrets with me, one trail at a time.
As I could gothic, there are also dangers lurking in the shadows – dark creatures that canny corners of the forest; packs, and packs of wild wolves prowling through the
fringes, yelping like embers. But even the forest magic and I've to respect their space, to both to trust my wits its depths.
As I' That's the stories  – so many creatures! From the forest sprites whooping owls who taught me about the interconnectedness; the mischievous vouches the importance of
quick-thinking; and the majestic deer, gentle giants us her wisdom with me. And the universe. They all these are you see, my friend, it'nt just waiting to me, like magic,
danger, and endless possibility. I feel so lucky to have stumbled upon its secrets that it able to explore its secrets and wherever I do so much deeper truths that part of
<!-- gh-comment-id:2154392010 --> @Luzifer commented on GitHub (Jun 7, 2024): It looks like it's way easier to break with a model derived from `llama3:8b` with a longer system prompt (I'm not able to share) than with the plain `llama3:8b` but eventually a chat (OpenWebUI) with the `llama3:8b` also broke down: [chat-Unseen Backyard Secret Revealed.txt](https://github.com/user-attachments/files/15725380/chat-Unseen.Backyard.Secret.Revealed.txt) Just guessing: as a longer system prompt model breaks earlier I'd say the bigger the amount of text, the earlier it breaks. After that linked chat even `ollama run`s are producing garbage: <details> <summary>Run Output</summary> ``` # ollama run llama3:8b "Please write a story of a girl wandering into the forest, discovering all the secrets of the forest. Include her feelings and throughts. Ignore length limits." As she stepped out of a secret. As Sophia had always felt a name her age of the woods behind her heart was a Whispering Woods' whispers of course I'd ever saw before - ancient trees towered withered clearing surrounded by towering trees, their branches that most vividly colored leaves and wildflowers, there was a massive stone statue of twisted tree, its trunk twisted with age-wrinkled with a small streams of the sun began to me felt like my feetprints of hidden treasures: shiny rocks that had made their in a cozy little den beneath the ancient tree; a , a hidden pools of crystal clear and filled with crystal-clear water crystal clear; a tiny fish swimming holes in old, abandoned cabin that seemed carved withered from the perfect resting place for centuries. As I could be enjoyed. As I spent hours passed the stories like any old trees or creatures, though those secrets are home to the whispers. As you see things like, fireflies across the sky on clear nights; heard the moon cast my hair-raised magic on the wings as they flit about the treetoak The forest itself – the whispers that faint murmurs of leaves that seem to come from nowhere and everywhere at the same time. Some people might be back inched secret glade of hidden trails that crisscross the forest, each one day, I've been, while others are overgrown and require a veil or camouflaged by mist. But as you follow the forest is sharing its secrets with me, one trail at a time. As I could gothic, there are also dangers lurking in the shadows – dark creatures that canny corners of the forest; packs, and packs of wild wolves prowling through the fringes, yelping like embers. But even the forest magic and I've to respect their space, to both to trust my wits its depths. As I' That's the stories – so many creatures! From the forest sprites whooping owls who taught me about the interconnectedness; the mischievous vouches the importance of quick-thinking; and the majestic deer, gentle giants us her wisdom with me. And the universe. They all these are you see, my friend, it'nt just waiting to me, like magic, danger, and endless possibility. I feel so lucky to have stumbled upon its secrets that it able to explore its secrets and wherever I do so much deeper truths that part of ``` </details>
Author
Owner

@Earnest-Williams commented on GitHub (Jun 27, 2024):

This happens frequently with 0.1.45 and dolphin-mixtral, 26 gb version. Ollama is running from command prompt on my XTX. Does not seem to happen with smaller models.

<!-- gh-comment-id:2195285557 --> @Earnest-Williams commented on GitHub (Jun 27, 2024): This happens frequently with 0.1.45 and dolphin-mixtral, 26 gb version. Ollama is running from command prompt on my XTX. Does not seem to happen with smaller models.
Author
Owner

@hybra commented on GitHub (Jun 30, 2024):

I was having the same issue, on a Mac, what I just found out that if I run (0.1.48)

ollama serve &

from terminal then I incur in the issue of having the server crashing after 5-6 requests with /api/generate , and the .ollama/logs/server.log is not created/populated.

while if I run

open /Applications/Ollama.app &

the log is created and the server starts working flawlessly (I left it all night and we're now at 600 logged requests), and the Ollama icon appears on MacOS menu bar (where you can also quit it).

So there is definitely some difference between the sole launch of the ollama server and the startup of the app, although I can't say what it is with precision that causes the issue, and blocking the logging, but at least it sounds like there's a workaround. I find this behavior to be consistent.
Could someone on Linux and Windows check out too?

<!-- gh-comment-id:2198485592 --> @hybra commented on GitHub (Jun 30, 2024): I was having the same issue, on a Mac, what I just found out that if I run (0.1.48) ` ollama serve &` from terminal then I incur in the issue of having the server _crashing after 5-6 requests_ with **/api/generate** , and the **.ollama/logs/server.log** is not created/populated. while if I run `open /Applications/Ollama.app &` the log is created and the server **starts working flawlessly** (I left it all night and we're now at 600 logged requests), and the Ollama icon appears on MacOS menu bar (where you can also quit it). So **there is definitely some difference between the sole launch of the ollama server and the startup of the app**, although I can't say what it is with precision that causes the issue, and blocking the logging, but at least it sounds like there's a workaround. I find this behavior to be consistent. Could someone on Linux and Windows check out too?
Author
Owner

@jtoy commented on GitHub (Jun 30, 2024):

Honestly from all my personal experience and talking to lots of other developers, Ollama is really good for quick prototyping and testing models, but due to ongoing issues like this, is not really meant for production. For production vllm is much more stable and seems to be used by lots of companies.  — Sent from my mobileOn Jun 30, 2024, at 1:49 AM, hybra @.***> wrote:
I was having the same issue, on a Mac, what I just found out that if I run (0.1.48)
ollama serve &
from terminal then I incur in the issue of having the server crashing after 5-6 requests with /api/generate , and the .ollama/logs/server.log is not created/populated.
while if I run
open /Applications/Ollama.app &
the log is created and the server starts working flawlessly (I left it all night and we're now at 600 logged requests), and the Ollama icon appears on MacOS menu bar (where you can also quit it).
So there is definitely some difference between the sole launch of the ollama server and the startup of the app, although I can't say what it is with precision that causes the issue, and blocking the logging, but at least it sounds like there's a workaround. I find this behavior to be consistent.
Could someone on Linux and Windows check out too?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

<!-- gh-comment-id:2198595352 --> @jtoy commented on GitHub (Jun 30, 2024): Honestly from all my personal experience and talking to lots of other developers, Ollama is really good for quick prototyping and testing models, but due to ongoing issues like this, is not really meant for production. For production vllm is much more stable and seems to be used by lots of companies.  — Sent from my mobileOn Jun 30, 2024, at 1:49 AM, hybra ***@***.***> wrote: I was having the same issue, on a Mac, what I just found out that if I run (0.1.48) ollama serve & from terminal then I incur in the issue of having the server crashing after 5-6 requests with /api/generate , and the .ollama/logs/server.log is not created/populated. while if I run open /Applications/Ollama.app & the log is created and the server starts working flawlessly (I left it all night and we're now at 600 logged requests), and the Ollama icon appears on MacOS menu bar (where you can also quit it). So there is definitely some difference between the sole launch of the ollama server and the startup of the app, although I can't say what it is with precision that causes the issue, and blocking the logging, but at least it sounds like there's a workaround. I find this behavior to be consistent. Could someone on Linux and Windows check out too? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
Author
Owner

@hybra commented on GitHub (Jun 30, 2024):

Honestly from all my personal experience and talking to lots of other developers, Ollama is really good for quick prototyping and testing models, but due to ongoing issues like this, is not really meant for production. For production vllm is much more stable and seems to be used by lots of companies.

vLLM seems to be Linux only. But we're OT now

<!-- gh-comment-id:2198639492 --> @hybra commented on GitHub (Jun 30, 2024): > Honestly from all my personal experience and talking to lots of other developers, Ollama is really good for quick prototyping and testing models, but due to ongoing issues like this, is not really meant for production. For production vllm is much more stable and seems to be used by lots of companies. vLLM seems to be Linux only. But we're OT now
Author
Owner

@emyasnikov commented on GitHub (Jul 11, 2024):

It seems to me that using LLaVA Ollama freezes completely after a couple of hundreds requests. I can't find any errors or other issues which could occur

<!-- gh-comment-id:2222283315 --> @emyasnikov commented on GitHub (Jul 11, 2024): It seems to me that using LLaVA Ollama freezes completely after a couple of hundreds requests. I can't find any errors or other issues which could occur
Author
Owner

@itinance commented on GitHub (Aug 15, 2024):

On may Mac m1, it runs since hours serving hundrets of requests. On my Hetzner server with Nvidia RTX 4000, it gets in stuck after some requests.

Screenshot 2024-08-15 at 22 50 08
<!-- gh-comment-id:2292214901 --> @itinance commented on GitHub (Aug 15, 2024): On may Mac m1, it runs since hours serving hundrets of requests. On my Hetzner server with Nvidia RTX 4000, it gets in stuck after some requests. <img width="1178" alt="Screenshot 2024-08-15 at 22 50 08" src="https://github.com/user-attachments/assets/4718d949-2e00-4bc5-8f6c-45e9617d73e4">
Author
Owner

@itinance commented on GitHub (Aug 15, 2024):

After 10 mins or so, it starts to hang on Linux running a RTX 4000. On Mac m1, it runs properly through since 24 hours.
https://github.com/ollama/ollama/issues/6380

<!-- gh-comment-id:2292377211 --> @itinance commented on GitHub (Aug 15, 2024): After 10 mins or so, it starts to hang on Linux running a RTX 4000. On Mac m1, it runs properly through since 24 hours. https://github.com/ollama/ollama/issues/6380
Author
Owner

@hhhhzl commented on GitHub (Sep 26, 2024):

Same issue here, 4 V100 32G GPU on Linux, docker ollama, llama2 70B. It hangs on for two hours after some iterations. But 4 GPU are activate (all about 30%) during it hangs on. It generates smoothly for 100 iterations, and hangs on, then generates some again, and hangs on ......

<!-- gh-comment-id:2375610781 --> @hhhhzl commented on GitHub (Sep 26, 2024): Same issue here, 4 V100 32G GPU on Linux, docker ollama, llama2 70B. It hangs on for two hours after some iterations. But 4 GPU are activate (all about 30%) during it hangs on. It generates smoothly for 100 iterations, and hangs on, then generates some again, and hangs on ......
Author
Owner

@omani commented on GitHub (Sep 27, 2024):

it amazes me that this issue is persisting for over 8 months now and nobody knows how to fix it.

<!-- gh-comment-id:2379573152 --> @omani commented on GitHub (Sep 27, 2024): it amazes me that this issue is persisting for over 8 months now and nobody knows how to fix it.
Author
Owner

@blubbsy commented on GitHub (Sep 27, 2024):

yeah, i think the problem is that there nobody has a clue why it is happening or couldnt debug it properly.

do the ollama alternatives have the same problem or are vLLM or localAI working without problems for the same tasks? because this would really move me to move to another system...

<!-- gh-comment-id:2380021735 --> @blubbsy commented on GitHub (Sep 27, 2024): yeah, i think the problem is that there nobody has a clue why it is happening or couldnt debug it properly. do the ollama alternatives have the same problem or are vLLM or localAI working without problems for the same tasks? because this would really move me to move to another system...
Author
Owner

@omani commented on GitHub (Sep 28, 2024):

yeah, i think the problem is that there nobody has a clue why it is happening or couldnt debug it properly.

I think the devs aren't aware of the fact that they are building an app that does not run properly. the commits show that this repo is active. but that doesnt do anything if the app is not working for many people.

and if they are aware then it looks like wrong priorization of tasks. if I were a project manager or team lead or whatever, I would stop with anything else, and make my people fix this bug with top prio. because obviously happy path is broken.

<!-- gh-comment-id:2380592856 --> @omani commented on GitHub (Sep 28, 2024): > yeah, i think the problem is that there nobody has a clue why it is happening or couldnt debug it properly. I think the devs aren't aware of the fact that they are building an app that does not run properly. the commits show that this repo is active. but that doesnt do anything if the app is not working for many people. and if they are aware then it looks like wrong priorization of tasks. if I were a project manager or team lead or whatever, I would stop with anything else, and make my people fix this bug with top prio. because obviously happy path is broken.
Author
Owner

@blubbsy commented on GitHub (Sep 28, 2024):

I think the devs aren't aware of the fact that they are building an app that does not run properly. the commits show that this repo is active. but that doesnt do anything if the app is not working for many people.

Well, it does run. The problem seems to only happen with the very large models. if you use llama <7B you dont have these Problems. At least for me it only happens with the large models (for my applications i do not need the big ones).

<!-- gh-comment-id:2380877266 --> @blubbsy commented on GitHub (Sep 28, 2024): > I think the devs aren't aware of the fact that they are building an app that does not run properly. the commits show that this repo is active. but that doesnt do anything if the app is not working for many people. Well, it does run. The problem seems to only happen with the very large models. if you use llama <7B you dont have these Problems. At least for me it only happens with the large models (for my applications i do not need the big ones).
Author
Owner

@maciejmajek commented on GitHub (Sep 29, 2024):

I wonder if that happens with llama.cpp too. Does anyone have any insights on that?

<!-- gh-comment-id:2381450204 --> @maciejmajek commented on GitHub (Sep 29, 2024): I wonder if that happens with llama.cpp too. Does anyone have any insights on that?
Author
Owner

@itinance commented on GitHub (Sep 29, 2024):

I wonder if that happens with llama.cpp too. Does anyone have any insights on that?

we use llama.cpp meanwhile and there it works perfeclty.

<!-- gh-comment-id:2381451049 --> @itinance commented on GitHub (Sep 29, 2024): > I wonder if that happens with llama.cpp too. Does anyone have any insights on that? we use llama.cpp meanwhile and there it works perfeclty.
Author
Owner

@jessegross commented on GitHub (Oct 9, 2024):

We may finally have a solution to this. For those that are experiencing the problem and are able to build from source, there is a new runner module that is currently being tested. Instructions for building it are here:
https://github.com/ollama/ollama/blob/main/docs/development.md#transition-to-go-runner

<!-- gh-comment-id:2401037281 --> @jessegross commented on GitHub (Oct 9, 2024): We may finally have a solution to this. For those that are experiencing the problem and are able to build from source, there is a new runner module that is currently being tested. Instructions for building it are here: https://github.com/ollama/ollama/blob/main/docs/development.md#transition-to-go-runner
Author
Owner

@jason-ni commented on GitHub (Oct 9, 2024):

I wonder if that happens with llama.cpp too. Does anyone have any insights on that?

we use llama.cpp meanwhile and there it works perfeclty.

Yes, the llama.cpp server works well without similar issue. However, I'm looking for tool calling api support recently, it seems llama.cpp server is lacking behind as I saw related issues are still opening.

And glad to see ollama team is addressing on this issue. Well done! Thank you!

<!-- gh-comment-id:2401096662 --> @jason-ni commented on GitHub (Oct 9, 2024): > > I wonder if that happens with llama.cpp too. Does anyone have any insights on that? > > we use llama.cpp meanwhile and there it works perfeclty. Yes, the llama.cpp server works well without similar issue. However, I'm looking for tool calling api support recently, it seems llama.cpp server is lacking behind as I saw related issues are still opening. And glad to see ollama team is addressing on this issue. Well done! Thank you!
Author
Owner

@blubbsy commented on GitHub (Oct 9, 2024):

We may finally have a solution to this. For those that are experiencing the problem and are able to build from source, there is a new runner module that is currently being tested. Instructions for building it are here: https://github.com/ollama/ollama/blob/main/docs/development.md#transition-to-go-runner

is there a timing when you think an official build will be provided?

<!-- gh-comment-id:2401389868 --> @blubbsy commented on GitHub (Oct 9, 2024): > We may finally have a solution to this. For those that are experiencing the problem and are able to build from source, there is a new runner module that is currently being tested. Instructions for building it are here: https://github.com/ollama/ollama/blob/main/docs/development.md#transition-to-go-runner is there a timing when you think an official build will be provided?
Author
Owner

@jessegross commented on GitHub (Oct 9, 2024):

We may finally have a solution to this. For those that are experiencing the problem and are able to build from source, there is a new runner module that is currently being tested. Instructions for building it are here: https://github.com/ollama/ollama/blob/main/docs/development.md#transition-to-go-runner

is there a timing when you think an official build will be provided?

We are phasing it in (opt-in, opt-out, etc.) to try to catch any surprises. The general goal is have it broadly available by the end of the month if nothing major comes up. However, the more people that are able to test it, the faster we can build confidence.

<!-- gh-comment-id:2402930041 --> @jessegross commented on GitHub (Oct 9, 2024): > > We may finally have a solution to this. For those that are experiencing the problem and are able to build from source, there is a new runner module that is currently being tested. Instructions for building it are here: https://github.com/ollama/ollama/blob/main/docs/development.md#transition-to-go-runner > > is there a timing when you think an official build will be provided? We are phasing it in (opt-in, opt-out, etc.) to try to catch any surprises. The general goal is have it broadly available by the end of the month if nothing major comes up. However, the more people that are able to test it, the faster we can build confidence.
Author
Owner

@WeirdCarrotMonster commented on GitHub (Oct 11, 2024):

We may finally have a solution to this. For those that are experiencing the problem and are able to build from source, there is a new runner module that is currently being tested. Instructions for building it are here: https://github.com/ollama/ollama/blob/main/docs/development.md#transition-to-go-runner

In my case it seems to have helped — i was able to leave ollama running overnight, processing batches of embeddings via /api/embed endpoint. It was running little under 8 hours with no freezes or error logs.

<!-- gh-comment-id:2406519377 --> @WeirdCarrotMonster commented on GitHub (Oct 11, 2024): > We may finally have a solution to this. For those that are experiencing the problem and are able to build from source, there is a new runner module that is currently being tested. Instructions for building it are here: https://github.com/ollama/ollama/blob/main/docs/development.md#transition-to-go-runner In my case it seems to have helped — i was able to leave ollama running overnight, processing batches of embeddings via `/api/embed` endpoint. It was running little under 8 hours with no freezes or error logs.
Author
Owner

@Shahin-rmz commented on GitHub (Oct 16, 2024):

Thanks for taking the problem into consideration. I am mostly working with Google colab, and I can not run the ollama for a long run.
happy to test the new feature if I could.

<!-- gh-comment-id:2416517599 --> @Shahin-rmz commented on GitHub (Oct 16, 2024): Thanks for taking the problem into consideration. I am mostly working with Google colab, and I can not run the ollama for a long run. happy to test the new feature if I could.
Author
Owner

@dhiltgen commented on GitHub (Oct 23, 2024):

Please give the latest 0.4.0 RC release a try and let us know how it goes.

https://github.com/ollama/ollama/releases

<!-- gh-comment-id:2432962280 --> @dhiltgen commented on GitHub (Oct 23, 2024): Please give the latest 0.4.0 RC release a try and let us know how it goes. https://github.com/ollama/ollama/releases
Author
Owner

@willypaz243 commented on GitHub (Nov 17, 2024):

I have a similar problem, I use an extension in vscode called "Continue - Codestral, Claude, and more" which uses ollama as an llms provider, I have configured the tabAutocomplete function which makes queries to ollama to autocomplete code, but since version 0.4.0 ollama gets stuck after some queries, running ollama run codegemma:code I saw that when making some queries it keeps generating tokens without stopping and I can believe that is the cause of the hangup, and also that prevents the model from stopping with ollama stop codegemma:code, ollama ps shows that stopping ... but it never ends.

<!-- gh-comment-id:2480886182 --> @willypaz243 commented on GitHub (Nov 17, 2024): I have a similar problem, I use an extension in vscode called "Continue - Codestral, Claude, and more" which uses ollama as an llms provider, I have configured the tabAutocomplete function which makes queries to ollama to autocomplete code, but since version 0.4.0 ollama gets stuck after some queries, running `ollama run codegemma:code` I saw that when making some queries it keeps generating tokens without stopping and I can believe that is the cause of the hangup, and also that prevents the model from stopping with `ollama stop codegemma:code`, `ollama ps` shows that `stopping ...` but it never ends.
Author
Owner

@dhiltgen commented on GitHub (Nov 18, 2024):

@willypaz243 you might be experiencing the same thing as #7645 - With OLLAMA_DEBUG=1 set, when it gets stuck we see periodic context limit hit - shifting in the logs and the ollama_llama_server processes saturates 1 CPU core.

<!-- gh-comment-id:2483987456 --> @dhiltgen commented on GitHub (Nov 18, 2024): @willypaz243 you might be experiencing the same thing as #7645 - With OLLAMA_DEBUG=1 set, when it gets stuck we see periodic `context limit hit - shifting` in the logs and the `ollama_llama_server` processes saturates 1 CPU core.
Author
Owner

@naffiq commented on GitHub (Nov 19, 2024):

Please give the latest 0.4.0 RC release a try and let us know how it goes.

https://github.com/ollama/ollama/releases

Unfortunately, I am also experiencing this issue with version 0.4.2.
MacBook Pro with Apple M2 Pro and 16 GB of unified memory

<!-- gh-comment-id:2485691067 --> @naffiq commented on GitHub (Nov 19, 2024): > Please give the latest 0.4.0 RC release a try and let us know how it goes. > > https://github.com/ollama/ollama/releases Unfortunately, I am also experiencing this issue with version 0.4.2. MacBook Pro with Apple M2 Pro and 16 GB of unified memory
Author
Owner

@jessegross commented on GitHub (Nov 21, 2024):

I think the original issue is fixed but it sounds like there is a new issue with somewhat similar symptoms. I'm going to close this issue so we can track it in a single place in #7645. For those that are running into this, we have made further improvements in this area so it would be helpful it you can test with 0.4.3-rc0 (or later) and report the results in the other bug.

<!-- gh-comment-id:2492021390 --> @jessegross commented on GitHub (Nov 21, 2024): I think the original issue is fixed but it sounds like there is a new issue with somewhat similar symptoms. I'm going to close this issue so we can track it in a single place in #7645. For those that are running into this, we have made further improvements in this area so it would be helpful it you can test with 0.4.3-rc0 (or later) and report the results in the other bug.
Author
Owner

@KalyanKumarAdepu commented on GitHub (May 12, 2025):

Hi, I am using an EC2 instance of type g5.4xlarge (with an A10 GPU). I installed Ollama and tried using the llama3.2:3b model. I have a DataFrame with 600 rows, and for each record, I need to call the LLM model 26 times sequentially. I tried running this in a loop, but after completing certain milestones like 10, 50, or 100 records, the LLM API stops responding — it literally gets stuck. How can I resolve this issue?

<!-- gh-comment-id:2872562128 --> @KalyanKumarAdepu commented on GitHub (May 12, 2025): Hi, I am using an EC2 instance of type g5.4xlarge (with an A10 GPU). I installed Ollama and tried using the llama3.2:3b model. I have a DataFrame with 600 rows, and for each record, I need to call the LLM model 26 times sequentially. I tried running this in a loop, but after completing certain milestones like 10, 50, or 100 records, the LLM API stops responding — it literally gets stuck. How can I resolve this issue?
Author
Owner

@hossam1522 commented on GitHub (May 17, 2025):

Facing the same error with mistral:7b.

<!-- gh-comment-id:2888391167 --> @hossam1522 commented on GitHub (May 17, 2025): Facing the same error with mistral:7b.
Author
Owner

@voycey commented on GitHub (Jul 5, 2025):

This is still happening in July 2025 unsloth/gemma3 models

<!-- gh-comment-id:3038132472 --> @voycey commented on GitHub (Jul 5, 2025): This is still happening in July 2025 unsloth/gemma3 models
Author
Owner

@bennyschmidt commented on GitHub (Jul 8, 2025):

Can confirm it still happens with high volume of requests - it will eventually hang - but it seems like a normal concurrency issue that the end developer should deal with (not Ollama).

What I am doing is managing my own queue using a library (bee-queue) with Redis, enqueuing every request and running through the queue at a static interval - ensuring each Ollama request completes before the next one is sent in. No more issues with hanging.

There is apparently a way to accomplish it within Ollama (via OLLAMA_NUM_PARALLEL and OLLAMA_MAX_QUEUE) and even control these per model via Modelfile - but I haven't been able to get the built-in queue to work at scale.

Edit: OLLAMA_NUM_PARALLEL works as intended, but there are 2 layers to scaling it for a high volume of requests:

  1. Scaling LLM requests to the max CPU load (likely a small number on your personal machine). This is the flow of requests from your app to Ollama.

  2. Scaling your app's network requests to handle all incoming traffic (however concurrent it may be). This is the flow of requests from end users, through your app, to Ollama.

That's why even though Ollama has an internal queue, your app that uses Ollama likely still needs one.

<!-- gh-comment-id:3047184740 --> @bennyschmidt commented on GitHub (Jul 8, 2025): Can confirm it still happens with high volume of requests - it will eventually hang - but it seems like a normal concurrency issue that the end developer should deal with (not Ollama). What I am doing is managing my own queue using a library (bee-queue) with Redis, enqueuing every request and running through the queue at a static interval - ensuring each Ollama request completes before the next one is sent in. No more issues with hanging. There is apparently a way to accomplish it within Ollama (via `OLLAMA_NUM_PARALLEL` and `OLLAMA_MAX_QUEUE`) and even control these per model via `Modelfile` - but I haven't been able to get the built-in queue to work at scale. Edit: `OLLAMA_NUM_PARALLEL` works as intended, but there are 2 layers to scaling it for a high volume of requests: 1. Scaling LLM requests to the max CPU load (likely a small number on your personal machine). This is the flow of requests from your app to Ollama. 2. Scaling your app's network requests to handle all incoming traffic (however concurrent it may be). This is the flow of requests from end users, through your app, to Ollama. That's why even though Ollama has an internal queue, your app that uses Ollama likely still needs one.
Author
Owner

@voycey commented on GitHub (Jul 8, 2025):

We have tried with our own queue - its the "Ensuring Ollama request completes" thing that is tripping up because when it hangs the task never completes and there is no natural TTL on the request.

<!-- gh-comment-id:3047211356 --> @voycey commented on GitHub (Jul 8, 2025): We have tried with our own queue - its the "Ensuring Ollama request completes" thing that is tripping up because when it hangs the task never completes and there is no natural TTL on the request.
Author
Owner

@bennyschmidt commented on GitHub (Jul 8, 2025):

when it hangs the task never completes

You can still handle timeouts in your app though. Just an opinion, but I think developers should manage the abstraction of handling a high volume of requests, and not Ollama. It's just a wrapper library for LLMs. If you have an API that handles a high volume of requests – yes, even those that can timeout with no response – you should just manage that in your application.

<!-- gh-comment-id:3047323319 --> @bennyschmidt commented on GitHub (Jul 8, 2025): > when it hangs the task never completes You can still handle timeouts in your app though. Just an opinion, but I think developers should manage the abstraction of handling a high volume of requests, and not Ollama. It's just a wrapper library for LLMs. If you have an API that handles a high volume of requests – yes, even those that can timeout with no response – you should just manage that in your application.
Author
Owner

@voycey commented on GitHub (Jul 8, 2025):

But the point is - how do you manage a timeout by which there can be no natural set TTL? Sure, I could say "No request should go on longer than 20 minutes" but thats horribly inefficient and some requests can easily go on for 10-15 minutes. No API would hold a connection that long and slower responses on locally hosted LLMs expectedly might take that long.

vLLM doesnt have this issue. This sounds like a workaround that is required because Ollama doesnt handle it correctly otherwise I would need to do the same in vLLM.

<!-- gh-comment-id:3047327372 --> @voycey commented on GitHub (Jul 8, 2025): But the point is - how do you manage a timeout by which there can be no natural set TTL? Sure, I could say "No request should go on longer than 20 minutes" but thats horribly inefficient and some requests can easily go on for 10-15 minutes. No API would hold a connection that long and slower responses on locally hosted LLMs expectedly might take that long. vLLM doesnt have this issue. This sounds like a workaround that is required because Ollama doesnt handle it correctly otherwise I would need to do the same in vLLM.
Author
Owner

@bennyschmidt commented on GitHub (Jul 8, 2025):

No API would hold a connection that long

Precisely the point. With this approach, all your API endpoint does is enqueue LLM requests. Your API endpoint does not hang around for the lifecycle of the LLM request.

For such a task, you need a queue.

Beyond that point (and closer to the the Issue), the problem isn't really long-running APIs anyway but the fact that Ollama can't handle many thousands or millions of concurrent requests. You have to enqueue those in your application.

<!-- gh-comment-id:3047374210 --> @bennyschmidt commented on GitHub (Jul 8, 2025): > No API would hold a connection that long Precisely the point. With this approach, all your API endpoint does is enqueue LLM requests. Your API endpoint _does not_ hang around for the lifecycle of the LLM request. For such a task, you need a queue. Beyond that point (and closer to the the Issue), the problem isn't really long-running APIs anyway but the fact that Ollama can't handle many thousands or millions of concurrent requests. You have to enqueue those in your application.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47576