[GH-ISSUE #1458] Ollama hung after 30 minute of use #26543

Closed
opened 2026-04-22 02:52:44 -05:00 by GiteaMirror · 24 comments
Owner

Originally created by @lfoppiano on GitHub (Dec 11, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1458

I'm running Ollama on my mac M1 and I'm trying to use the 7b models for processing batches of questions / answers.
I noticed that after a while ollama just hang and the process stay there forever.

Is there a way to know what's going on?

I did not find a way to get to the logs.

Thank you in advance

Originally created by @lfoppiano on GitHub (Dec 11, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1458 I'm running Ollama on my mac M1 and I'm trying to use the 7b models for processing batches of questions / answers. I noticed that after a while ollama just hang and the process stay there forever. Is there a way to know what's going on? I did not find a way to get to the logs. Thank you in advance
Author
Owner

@igorschlum commented on GitHub (Dec 11, 2023):

Hi @lfoppiano
How much memory you have on your Mac M1 and which 7B model do you use?

This command displays Ollama's log on Mac

cat ~/.ollama/logs/server.log

<!-- gh-comment-id:1849415412 --> @igorschlum commented on GitHub (Dec 11, 2023): Hi @lfoppiano How much memory you have on your Mac M1 and which 7B model do you use? This command displays Ollama's log on Mac cat ~/.ollama/logs/server.log
Author
Owner

@lfoppiano commented on GitHub (Dec 11, 2023):

Thanks. I have 32Gb of memory.

<!-- gh-comment-id:1849426207 --> @lfoppiano commented on GitHub (Dec 11, 2023): Thanks. I have 32Gb of memory.
Author
Owner

@igorschlum commented on GitHub (Dec 11, 2023):

I have also a 32GB Mac M1. Can you provide a sample script so we can test it on our side to reproduce the issue?
Which LLM do you use. Llama2:7b ?

<!-- gh-comment-id:1849459815 --> @igorschlum commented on GitHub (Dec 11, 2023): I have also a 32GB Mac M1. Can you provide a sample script so we can test it on our side to reproduce the issue? Which LLM do you use. Llama2:7b ?
Author
Owner

@salbahra commented on GitHub (Dec 18, 2023):

I am also having this issue. Ollama is working great for small batches and single messages however with a very large batch (running more than 30 minutes) it eventually stalls. I have to quit Ollama and restart it for it resume functionality properly.

I am using an M3 128GB MacBook and the model I'm using is Mixtral.

<!-- gh-comment-id:1860694325 --> @salbahra commented on GitHub (Dec 18, 2023): I am also having this issue. Ollama is working great for small batches and single messages however with a very large batch (running more than 30 minutes) it eventually stalls. I have to quit Ollama and restart it for it resume functionality properly. I am using an M3 128GB MacBook and the model I'm using is Mixtral.
Author
Owner

@modeseven commented on GitHub (Dec 25, 2023):

Same issue for me.. running ollama in Docker image, works great on fresh startup. after a few prompts or about 30 mins it goes to lala land and I have to bounce my container. doesn't matter which model I'm running.

<!-- gh-comment-id:1869153721 --> @modeseven commented on GitHub (Dec 25, 2023): Same issue for me.. running ollama in Docker image, works great on fresh startup. after a few prompts or about 30 mins it goes to lala land and I have to bounce my container. doesn't matter which model I'm running.
Author
Owner

@technovangelist commented on GitHub (Dec 26, 2023):

Hi everyone, thanks for submitting this issue. Which models are being used here? The original poster said it was a 7billion parameter model so I would love to get an example and try to reproduce. Then someone mentioned mixtral which is closer to 50 billion parameters. It shouldn't happen anywhere. I just want to start looking in the right place.

<!-- gh-comment-id:1869591504 --> @technovangelist commented on GitHub (Dec 26, 2023): Hi everyone, thanks for submitting this issue. Which models are being used here? The original poster said it was a 7billion parameter model so I would love to get an example and try to reproduce. Then someone mentioned mixtral which is closer to 50 billion parameters. It shouldn't happen anywhere. I just want to start looking in the right place.
Author
Owner

@technovangelist commented on GitHub (Dec 26, 2023):

I tried running a series of about 200 questions through mixtral. This took about 40 minutes to run and memory used hovered around 40GB. I had to run through a lot more questions on a 7b model to get over 40 minutes, but it also ran as expected. So can I get a bit more info from @modeseven @salbahra and @lfoppiano.

How are you running Ollama? What platform (Mac, Linux, or WSL2)? On Docker or not? How much RAM? What video card and how much VRAM? What model are you using? This information will help us track down where the issue might be.

<!-- gh-comment-id:1869665453 --> @technovangelist commented on GitHub (Dec 26, 2023): I tried running a series of about 200 questions through mixtral. This took about 40 minutes to run and memory used hovered around 40GB. I had to run through a lot more questions on a 7b model to get over 40 minutes, but it also ran as expected. So can I get a bit more info from @modeseven @salbahra and @lfoppiano. How are you running Ollama? What platform (Mac, Linux, or WSL2)? On Docker or not? How much RAM? What video card and how much VRAM? What model are you using? This information will help us track down where the issue might be.
Author
Owner

@salbahra commented on GitHub (Dec 26, 2023):

@technovangelist Thank you so much for looking into this! For me, I am using mixtral instruct 8bit and fp16 on an M3 128GB RAM. My code is something as follows:

import json
import time
import threading
from typing import Optional

from langchain_experimental.llms.ollama_functions import OllamaFunctions
from langchain.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.chains.openai_functions.utils import _convert_schema, _resolve_schema_references, get_llm_kwargs

llm = OllamaFunctions(model="jmorgan/mixtral", temperature=0.1)

class FunctionThread(threading.Thread):
    def __init__(self, func, args):
        threading.Thread.__init__(self)
        self.func = func
        self.args = args
        self.result = None
        self.exception = None

    def run(self):
        try:
            self.result = self.func(*self.args)
        except Exception as e:
            self.exception = e

def call_llm_with_timeout(func, args, timeout=10 * 60):
    thread = FunctionThread(func, args)
    thread.start()
    thread.join(timeout)
    if thread.is_alive():
        raise TimeoutError("Function call exceeded time limit")
    if thread.exception:
        raise thread.exception
    return thread.result

def call_llm(template, reply_shape, arguments):
    """Calls the language model with the provided template and arguments."""

    openai_schema = reply_shape.schema()
    openai_schema = _resolve_schema_references(
        openai_schema, openai_schema.get("definitions", {})
    )

    function = {
        "name": "information_extraction",
        "description": "Extracts the relevant information from the pathology report.",
        "parameters": {
            "type": "object",
            "properties": {
                "info": _convert_schema(openai_schema)
            },
            "required": ["info"],
        },
    }

    prompt = PromptTemplate(
        template=template,
        input_variables=["report"]
    )
    input = prompt.format_prompt(**arguments)

    debug_print(f"Calling LLM with input: {input.to_string()}")

    retries = 3
    for attempt in range(retries):
        try:
            output = call_llm_with_timeout(
                lambda: llm.bind(**get_llm_kwargs(function)).invoke(input.to_string()), 
                args=()
            )
            debug_print('Unprocessed output:', output.additional_kwargs['function_call'])
            output = json.loads(output.additional_kwargs['function_call']['arguments'])['info']
            debug_print('Processed output:', output)
            return output
        except Exception as e:
            debug_print(f'Error parsing output: {e}')
            if attempt < retries - 1:
                debug_print(f'Retrying... Attempt {attempt + 2}/{retries}')
                time.sleep(1)  # Wait for 1 second before retrying
            else:
                debug_print(f'Retries exhausted. Returning None.')
                return None

I added the timeout logic as a first attempt to get around this issue (but also to ensure the function call return conforms to the expected shape. I didn't find a way to do this via LangChain. I then run a for loop which executes many call_llm calls and runs for 2-3 hours sometimes before stopping to respond. Five different prompts/inputs are used throughout each for loop iteration. If you need me to share more code if it would help let me know please.

Thank you!

<!-- gh-comment-id:1869678291 --> @salbahra commented on GitHub (Dec 26, 2023): @technovangelist Thank you so much for looking into this! For me, I am using mixtral instruct 8bit and fp16 on an M3 128GB RAM. My code is something as follows: ``` import json import time import threading from typing import Optional from langchain_experimental.llms.ollama_functions import OllamaFunctions from langchain.prompts import PromptTemplate from langchain_core.pydantic_v1 import BaseModel, Field from langchain.chains.openai_functions.utils import _convert_schema, _resolve_schema_references, get_llm_kwargs llm = OllamaFunctions(model="jmorgan/mixtral", temperature=0.1) class FunctionThread(threading.Thread): def __init__(self, func, args): threading.Thread.__init__(self) self.func = func self.args = args self.result = None self.exception = None def run(self): try: self.result = self.func(*self.args) except Exception as e: self.exception = e def call_llm_with_timeout(func, args, timeout=10 * 60): thread = FunctionThread(func, args) thread.start() thread.join(timeout) if thread.is_alive(): raise TimeoutError("Function call exceeded time limit") if thread.exception: raise thread.exception return thread.result def call_llm(template, reply_shape, arguments): """Calls the language model with the provided template and arguments.""" openai_schema = reply_shape.schema() openai_schema = _resolve_schema_references( openai_schema, openai_schema.get("definitions", {}) ) function = { "name": "information_extraction", "description": "Extracts the relevant information from the pathology report.", "parameters": { "type": "object", "properties": { "info": _convert_schema(openai_schema) }, "required": ["info"], }, } prompt = PromptTemplate( template=template, input_variables=["report"] ) input = prompt.format_prompt(**arguments) debug_print(f"Calling LLM with input: {input.to_string()}") retries = 3 for attempt in range(retries): try: output = call_llm_with_timeout( lambda: llm.bind(**get_llm_kwargs(function)).invoke(input.to_string()), args=() ) debug_print('Unprocessed output:', output.additional_kwargs['function_call']) output = json.loads(output.additional_kwargs['function_call']['arguments'])['info'] debug_print('Processed output:', output) return output except Exception as e: debug_print(f'Error parsing output: {e}') if attempt < retries - 1: debug_print(f'Retrying... Attempt {attempt + 2}/{retries}') time.sleep(1) # Wait for 1 second before retrying else: debug_print(f'Retries exhausted. Returning None.') return None ``` I added the timeout logic as a first attempt to get around this issue (but also to ensure the function call return conforms to the expected shape. I didn't find a way to do this via LangChain. I then run a for loop which executes many `call_llm` calls and runs for 2-3 hours sometimes before stopping to respond. Five different prompts/inputs are used throughout each for loop iteration. If you need me to share more code if it would help let me know please. Thank you!
Author
Owner

@tubnt commented on GitHub (Dec 29, 2023):

m3+64g macos
i7+64g+4090+windows+docker+wsl2

I found this problem on both systems. When I output long novels, I usually get stuck after typing three times in the same conversation without any error. Just the output stopped.

<!-- gh-comment-id:1872043182 --> @tubnt commented on GitHub (Dec 29, 2023): m3+64g macos i7+64g+4090+windows+docker+wsl2 I found this problem on both systems. When I output long novels, I usually get stuck after typing three times in the same conversation without any error. Just the output stopped.
Author
Owner

@technovangelist commented on GitHub (Jan 3, 2024):

We made some updates to the models pretty recently. Can you try repulling the models (ollama pull mixtral) to ensure you have the latest? @lfoppiano @modeseven @tubnt

and @salbahra I see you are using Jeff's first version of mixtral. Can you try switching over to the library version at https://ollama.ai/library/mixtral

<!-- gh-comment-id:1875699931 --> @technovangelist commented on GitHub (Jan 3, 2024): We made some updates to the models pretty recently. Can you try repulling the models (`ollama pull mixtral`) to ensure you have the latest? @lfoppiano @modeseven @tubnt and @salbahra I see you are using Jeff's first version of mixtral. Can you try switching over to the library version at https://ollama.ai/library/mixtral
Author
Owner

@cloudnativeengineer commented on GitHub (Jan 9, 2024):

I think I found something similar. Using (version HEAD-6164f37) with the command for instance in $(seq 1 17); do ollama run nous-hermes2:10.7b-solar-q4_K_M Hello; done, the ollama serve will stop generating text on the 17th run and won't process requests normally until ollama serve is restarted.

<!-- gh-comment-id:1883241013 --> @cloudnativeengineer commented on GitHub (Jan 9, 2024): I think I found something similar. Using `(version HEAD-6164f37)` with the command `for instance in $(seq 1 17); do ollama run nous-hermes2:10.7b-solar-q4_K_M Hello; done`, the `ollama serve` will stop generating text on the 17th run and won't process requests normally until `ollama serve` is restarted.
Author
Owner

@salbahra commented on GitHub (Jan 9, 2024):

@cloudnativeengineer I am also seeing issues with Nous Hermes. In fact I can consistently reproduce it with nous-hermes:70b-llama2-q6_K after just a few runs. One thing to note, when I quit Ollama from the status bar, it continued running as shown in the Activity Monitor and after force quitting it and relaunching things worked again.

@technovangelist After pulling the latest Mixtral model, I cannot reproduce the hanging with that model. This is making me think it is model specific and right now the Nous Hermes is the most affected in my testing.

<!-- gh-comment-id:1883246494 --> @salbahra commented on GitHub (Jan 9, 2024): @cloudnativeengineer I am also seeing issues with Nous Hermes. In fact I can consistently reproduce it with `nous-hermes:70b-llama2-q6_K` after just a few runs. One thing to note, when I quit Ollama from the status bar, it continued running as shown in the Activity Monitor and after force quitting it and relaunching things worked again. @technovangelist After pulling the latest Mixtral model, I cannot reproduce the hanging with that model. This is making me think it is model specific and right now the Nous Hermes is the most affected in my testing.
Author
Owner

@igorschlum commented on GitHub (Jan 9, 2024):

OK, I will take back to problem. I will run @technovangelist script on all models. Il will try to modify the script to run 1000 call for each api. It could be nice to have a memory status at the end of each 1000 api call. We will see if it's a problem of ollama or a problem running some models.

@jmorganca Do you run those kind of tests?

<!-- gh-comment-id:1883408993 --> @igorschlum commented on GitHub (Jan 9, 2024): OK, I will take back to problem. I will run @technovangelist script on all models. Il will try to modify the script to run 1000 call for each api. It could be nice to have a memory status at the end of each 1000 api call. We will see if it's a problem of ollama or a problem running some models. @jmorganca Do you run those kind of tests?
Author
Owner

@jmorganca commented on GitHub (Feb 20, 2024):

We do – this is an issue that's been fixed as of the last release. I'll close this for now but please do report if the issue is still happening

<!-- gh-comment-id:1953341879 --> @jmorganca commented on GitHub (Feb 20, 2024): We do – this is an issue that's been fixed as of the last release. I'll close this for now but please do report if the issue is still happening
Author
Owner

@kuccello commented on GitHub (Mar 3, 2024):

Encountering the issue described above:
I have downloaded the latest version of ollama (as of March 3, 2024) and am trying to run the nomic-embed-text with about 500 words - it works on the first two prompts for embeddings and then the process hangs - I'm running on M1 Ultra (Mac Studio first edition - 128G ram). Calling the api via node-fetch.

async function getEmbeddings(prompt: string): Promise<EmbeddingsResponse> {
  const response = await fetch('http://127.0.0.1:11434/api/embeddings', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      // model: 'nomic-embed-text',
      model: 'all-minilm',
      prompt
    })
  });

  if (!response.ok) {
    throw new Error(`HTTP error! status: ${response.status}`);
  }
  // TODO validate response data
  return response.json() as Promise<EmbeddingsResponse>;
}

Output from process running ollama serve

❯ ollama serve
time=2024-03-03T12:54:37.029-08:00 level=INFO source=images.go:710 msg="total blobs: 29"
time=2024-03-03T12:54:37.031-08:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-03T12:54:37.032-08:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)"
time=2024-03-03T12:54:37.032-08:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-03T12:54:37.049-08:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [metal]"
loading library /var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798/metal/libext_server.dylib
time=2024-03-03T12:54:52.976-08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798/metal/libext_server.dylib"
time=2024-03-03T12:54:52.976-08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
llama_model_loader: loaded meta data with 23 key-value pairs and 101 tensors from /Users/kuccello/.ollama/models/blobs/sha256:797b70c4edf85907fe0a49eb85811256f65fa0f7bf52166b147fd16be2be4662 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = all-MiniLM-L6-v2
llama_model_loader: - kv   2:                           bert.block_count u32              = 6
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 384
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 1536
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:   63 tensors
llama_model_loader: - type  f16:   38 tensors
llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 512
llm_load_print_meta: n_embd           = 384
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_layer          = 6
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 32
llm_load_print_meta: n_embd_head_v    = 32
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 384
llm_load_print_meta: n_embd_v_gqa     = 384
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 1536
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 512
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 22M
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 22.57 M
llm_load_print_meta: model size       = 43.10 MiB (16.02 BPW) 
llm_load_print_meta: general.name     = all-MiniLM-L6-v2
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_tensors: ggml ctx size =    0.08 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =    20.38 MiB, (   20.44 / 98304.00)
llm_load_tensors: offloading 6 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 7/7 layers to GPU
llm_load_tensors:        CPU buffer size =    22.73 MiB
llm_load_tensors:      Metal buffer size =    20.37 MiB
...............................
llama_new_context_with_model: n_ctx      = 256
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Ultra
ggml_metal_init: picking default device: Apple M1 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = /var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798
ggml_metal_init: loading '/var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 103079.22 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     2.25 MiB, (   24.50 / 98304.00)
llama_kv_cache_init:      Metal KV buffer size =     2.25 MiB
llama_new_context_with_model: KV self size  =    2.25 MiB, K (f16):    1.12 MiB, V (f16):    1.12 MiB
llama_new_context_with_model:        CPU input buffer size   =     2.26 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     4.27 MiB, (   28.77 / 98304.00)
llama_new_context_with_model:      Metal compute buffer size =     4.25 MiB
llama_new_context_with_model:        CPU compute buffer size =     0.75 MiB
llama_new_context_with_model: graph splits (measure): 2
time=2024-03-03T12:54:53.051-08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
[GIN] 2024/03/03 - 12:54:53 | 200 |  328.300208ms |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/03/03 - 12:54:53 | 200 |     473.209µs |       127.0.0.1 | POST     "/api/embeddings"
[1]    57562 killed     ollama serve

<!-- gh-comment-id:1975324899 --> @kuccello commented on GitHub (Mar 3, 2024): Encountering the issue described above: I have downloaded the latest version of ollama (as of March 3, 2024) and am trying to run the [nomic-embed-text](https://registry.ollama.ai/library/nomic-embed-text) with about 500 words - it works on the first two prompts for embeddings and then the process hangs - I'm running on M1 Ultra (Mac Studio first edition - 128G ram). Calling the api via node-fetch. ```ts async function getEmbeddings(prompt: string): Promise<EmbeddingsResponse> { const response = await fetch('http://127.0.0.1:11434/api/embeddings', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ // model: 'nomic-embed-text', model: 'all-minilm', prompt }) }); if (!response.ok) { throw new Error(`HTTP error! status: ${response.status}`); } // TODO validate response data return response.json() as Promise<EmbeddingsResponse>; } ``` Output from process running `ollama serve` ``` ❯ ollama serve time=2024-03-03T12:54:37.029-08:00 level=INFO source=images.go:710 msg="total blobs: 29" time=2024-03-03T12:54:37.031-08:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-03T12:54:37.032-08:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)" time=2024-03-03T12:54:37.032-08:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-03T12:54:37.049-08:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [metal]" loading library /var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798/metal/libext_server.dylib time=2024-03-03T12:54:52.976-08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798/metal/libext_server.dylib" time=2024-03-03T12:54:52.976-08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" llama_model_loader: loaded meta data with 23 key-value pairs and 101 tensors from /Users/kuccello/.ollama/models/blobs/sha256:797b70c4edf85907fe0a49eb85811256f65fa0f7bf52166b147fd16be2be4662 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = all-MiniLM-L6-v2 llama_model_loader: - kv 2: bert.block_count u32 = 6 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 384 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 1536 llama_model_loader: - kv 6: bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 63 tensors llama_model_loader: - type f16: 38 tensors llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 30522 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 384 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_layer = 6 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_embd_head_k = 32 llm_load_print_meta: n_embd_head_v = 32 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 384 llm_load_print_meta: n_embd_v_gqa = 384 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 1536 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 22M llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 22.57 M llm_load_print_meta: model size = 43.10 MiB (16.02 BPW) llm_load_print_meta: general.name = all-MiniLM-L6-v2 llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_tensors: ggml ctx size = 0.08 MiB ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 20.38 MiB, ( 20.44 / 98304.00) llm_load_tensors: offloading 6 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 7/7 layers to GPU llm_load_tensors: CPU buffer size = 22.73 MiB llm_load_tensors: Metal buffer size = 20.37 MiB ............................... llama_new_context_with_model: n_ctx = 256 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: Apple M1 Ultra ggml_metal_init: picking default device: Apple M1 Ultra ggml_metal_init: default.metallib not found, loading from source ggml_metal_init: GGML_METAL_PATH_RESOURCES = /var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798 ggml_metal_init: loading '/var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798/ggml-metal.metal' ggml_metal_init: GPU name: Apple M1 Ultra ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 103079.22 MB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 2.25 MiB, ( 24.50 / 98304.00) llama_kv_cache_init: Metal KV buffer size = 2.25 MiB llama_new_context_with_model: KV self size = 2.25 MiB, K (f16): 1.12 MiB, V (f16): 1.12 MiB llama_new_context_with_model: CPU input buffer size = 2.26 MiB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 4.27 MiB, ( 28.77 / 98304.00) llama_new_context_with_model: Metal compute buffer size = 4.25 MiB llama_new_context_with_model: CPU compute buffer size = 0.75 MiB llama_new_context_with_model: graph splits (measure): 2 time=2024-03-03T12:54:53.051-08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" [GIN] 2024/03/03 - 12:54:53 | 200 | 328.300208ms | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/03/03 - 12:54:53 | 200 | 473.209µs | 127.0.0.1 | POST "/api/embeddings" [1] 57562 killed ollama serve ```
Author
Owner

@smarti57 commented on GitHub (Apr 17, 2024):

I'm running into this myself with mistral:7b-instruct-v0.2-q6_K with ollama 0.1.31. I've got a repeating process that does 8 interactions on an input file and then moves on to the next file. I'm actually using the 7B to test on a linux workstation (128GB RAM) with an NVIDIA GPU (16GB VRAM) but have also been running on an M2 Max 64GB laptop to use a larger model. Looking at nvtop on the workstation the ollama process is using roughly 6.5GB of VRAM with no creep. Both seem to stall with high GPU/CPU utilization after the 40th or so interaction. At the same time, I've left ollama up for weeks on end backing a Big-AGI intstance and never had any issues, on the same machine. Could there be something going on with either the rapid injection of chats or possibly the model context creeping up? I know there was a ticket out for resetting context state to an initial state, that's why I'm wondering. At this point I'm just considering restarting ollama and taking the model reload hit every 4 runs or so, that would solve it, but be a real waste of time.

In looking over the log it seems to just be spitting this out over and over (1000s of entries) when it hangs,

Apr 17 19:29:16 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1597,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slo>

Every 13 seconds an entry is put in the log, same information. Prior to that I get kind of normal, expected log entries per intraction

Apr 17 18:57:47 ai ollama[1781]: [GIN] 2024/04/17 - 18:57:47 | 200 | 5.255910821s | 192.168.1.94 | POST "/api/chat"
Apr 17 18:57:47 ai ollama[1781]: {"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":6886,"tid":"139863497086528","timestamp":1713380267}
Apr 17 18:57:47 ai ollama[1781]: {"function":"update_slots","ga_i":0,"level":"INFO","line":1805,"msg":"slot progression","n_past":1,"n_past_se":0,"n_prompt_tokens_processed":1822,"slot_id":0,"task_id":6886,"tid":"139863497086528">
Apr 17 18:57:47 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1832,"msg":"kv cache rm [p0, end)","p0":1,"slot_id":0,"task_id":6886,"tid":"139863497086528","timestamp":1713380267}
Apr 17 18:57:51 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1597,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slo>
Apr 17 18:58:03 ai ollama[1781]: {"function":"print_timings","level":"INFO","line":265,"msg":"prompt eval time = 829.73 ms / 1822 tokens ( 0.46 ms per token, 2195.91 tokens per second)","n_prompt_tokens_processed":18>
Apr 17 18:58:03 ai ollama[1781]: {"function":"print_timings","level":"INFO","line":279,"msg":"generation eval time = 15314.43 ms / 1143 runs ( 13.40 ms per token, 74.64 tokens per second)","n_decoded":1143,"n_tokens_sec>
Apr 17 18:58:03 ai ollama[1781]: {"function":"print_timings","level":"INFO","line":289,"msg":" total time = 16144.16 ms","slot_id":0,"t_prompt_processing":829.725,"t_token_generation":15314.43,"t_total":16144.155,"task>
Apr 17 18:58:03 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1636,"msg":"slot released","n_cache_tokens":1943,"n_ctx":2048,"n_past":1942,"n_system_tokens":0,"slot_id":0,"task_id":6886,"tid":"139863497086528",>
Apr 17 18:58:03 ai ollama[1781]: [GIN] 2024/04/17 - 18:58:03 | 200 | 16.158508055s | 192.168.1.94 | POST "/api/chat"

<!-- gh-comment-id:2062142633 --> @smarti57 commented on GitHub (Apr 17, 2024): I'm running into this myself with mistral:7b-instruct-v0.2-q6_K with ollama 0.1.31. I've got a repeating process that does 8 interactions on an input file and then moves on to the next file. I'm actually using the 7B to test on a linux workstation (128GB RAM) with an NVIDIA GPU (16GB VRAM) but have also been running on an M2 Max 64GB laptop to use a larger model. Looking at nvtop on the workstation the ollama process is using roughly 6.5GB of VRAM with no creep. Both seem to stall with high GPU/CPU utilization after the 40th or so interaction. At the same time, I've left ollama up for weeks on end backing a Big-AGI intstance and never had any issues, on the same machine. Could there be something going on with either the rapid injection of chats or possibly the model context creeping up? I know there was a ticket out for resetting context state to an initial state, that's why I'm wondering. At this point I'm just considering restarting ollama and taking the model reload hit every 4 runs or so, that would solve it, but be a real waste of time. In looking over the log it seems to just be spitting this out over and over (1000s of entries) when it hangs, **_Apr 17 19:29:16 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1597,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slo>_** Every 13 seconds an entry is put in the log, same information. Prior to that I get kind of normal, expected log entries per intraction **_Apr 17 18:57:47 ai ollama[1781]: [GIN] 2024/04/17 - 18:57:47 | 200 | 5.255910821s | 192.168.1.94 | POST "/api/chat" Apr 17 18:57:47 ai ollama[1781]: {"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":6886,"tid":"139863497086528","timestamp":1713380267} Apr 17 18:57:47 ai ollama[1781]: {"function":"update_slots","ga_i":0,"level":"INFO","line":1805,"msg":"slot progression","n_past":1,"n_past_se":0,"n_prompt_tokens_processed":1822,"slot_id":0,"task_id":6886,"tid":"139863497086528"> Apr 17 18:57:47 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1832,"msg":"kv cache rm [p0, end)","p0":1,"slot_id":0,"task_id":6886,"tid":"139863497086528","timestamp":1713380267} Apr 17 18:57:51 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1597,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slo> Apr 17 18:58:03 ai ollama[1781]: {"function":"print_timings","level":"INFO","line":265,"msg":"prompt eval time = 829.73 ms / 1822 tokens ( 0.46 ms per token, 2195.91 tokens per second)","n_prompt_tokens_processed":18> Apr 17 18:58:03 ai ollama[1781]: {"function":"print_timings","level":"INFO","line":279,"msg":"generation eval time = 15314.43 ms / 1143 runs ( 13.40 ms per token, 74.64 tokens per second)","n_decoded":1143,"n_tokens_sec> Apr 17 18:58:03 ai ollama[1781]: {"function":"print_timings","level":"INFO","line":289,"msg":" total time = 16144.16 ms","slot_id":0,"t_prompt_processing":829.725,"t_token_generation":15314.43,"t_total":16144.155,"task> Apr 17 18:58:03 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1636,"msg":"slot released","n_cache_tokens":1943,"n_ctx":2048,"n_past":1942,"n_system_tokens":0,"slot_id":0,"task_id":6886,"tid":"139863497086528",> Apr 17 18:58:03 ai ollama[1781]: [GIN] 2024/04/17 - 18:58:03 | 200 | 16.158508055s | 192.168.1.94 | POST "/api/chat"_**
Author
Owner

@smarti57 commented on GitHub (Apr 17, 2024):

Just grabbed 0.1.32.... this MAY have been fixed, it's progressing MUCH further than it had previously. It seemed to run longer than before (I got 24 files done before it stalled) but it still stalled out after 17 minutes of continuous use over 24x8 interactions (whatever that works out to).

<!-- gh-comment-id:2062315360 --> @smarti57 commented on GitHub (Apr 17, 2024): Just grabbed 0.1.32.... this MAY have been fixed, it's progressing MUCH further than it had previously. It seemed to run longer than before (I got 24 files done before it stalled) but it still stalled out after 17 minutes of continuous use over 24x8 interactions (whatever that works out to).
Author
Owner

@smarti57 commented on GitHub (Apr 18, 2024):

Another update, this seems to be related to mistral. I've run gemma2b just to test it out quickly and it completed all 100 test input files. I'll try a 7B and 13b model and post updates.

<!-- gh-comment-id:2064217470 --> @smarti57 commented on GitHub (Apr 18, 2024): Another update, this seems to be related to mistral. I've run gemma2b just to test it out quickly and it completed all 100 test input files. I'll try a 7B and 13b model and post updates.
Author
Owner

@smarti57 commented on GitHub (Apr 23, 2024):

final update. I was able to mitigate this in my workflow by swapping out model files. The unload and reload of a new model with each group (no more than 8) prompts eliminated the lock issue. It's a bug for sure that is still in 0.1.32. Certain models (mistral and llama3 I'm looking at you) seem to exhibit it faster. Upping the ctx to 4096 helped a bit but didn't eliminate the issue.

<!-- gh-comment-id:2073543014 --> @smarti57 commented on GitHub (Apr 23, 2024): final update. I was able to mitigate this in my workflow by swapping out model files. The unload and reload of a new model with each group (no more than 8) prompts eliminated the lock issue. It's a bug for sure that is still in 0.1.32. Certain models (mistral and llama3 I'm looking at you) seem to exhibit it faster. Upping the ctx to 4096 helped a bit but didn't eliminate the issue.
Author
Owner

@cloudnativeengineer commented on GitHub (Apr 24, 2024):

@smarti57 thanks for sharing that unload trick! I hadn't heard of or even considered it a possibility.

<!-- gh-comment-id:2073805081 --> @cloudnativeengineer commented on GitHub (Apr 24, 2024): @smarti57 thanks for sharing that unload trick! I hadn't heard of or even considered it a possibility.
Author
Owner

@entmike commented on GitHub (Apr 25, 2024):

Another update, this seems to be related to mistral. I've run gemma2b just to test it out quickly and it completed all 100 test input files. I'll try a 7B and 13b model and post updates.

It also is happening with llava. I'll try your unload trick.

<!-- gh-comment-id:2078082141 --> @entmike commented on GitHub (Apr 25, 2024): > Another update, this seems to be related to mistral. I've run gemma2b just to test it out quickly and it completed all 100 test input files. I'll try a 7B and 13b model and post updates. It also is happening with llava. I'll try your unload trick.
Author
Owner

@reski-rukmantiyo commented on GitHub (May 5, 2024):

was able to mitigate this in my workflow by swapping out model files. The unload and reload of a new model with each group (no more than 8) prompts eliminated the lock issue

Hi @smarti57 , i try to solve this by upping ctx. Could you share how? because this is still happening on 0.1.33

<!-- gh-comment-id:2094532866 --> @reski-rukmantiyo commented on GitHub (May 5, 2024): > was able to mitigate this in my workflow by swapping out model files. The unload and reload of a new model with each group (no more than 8) prompts eliminated the lock issue Hi @smarti57 , i try to solve this by upping ctx. Could you share how? because this is still happening on 0.1.33
Author
Owner

@urlan commented on GitHub (Aug 30, 2025):

Well... it's happening nowadays. I had to implement ollama.chat using concurrent.futures to make a timeout.

<!-- gh-comment-id:3238882439 --> @urlan commented on GitHub (Aug 30, 2025): Well... it's happening nowadays. I had to implement ollama.chat using concurrent.futures to make a timeout.
Author
Owner

@omerts commented on GitHub (Apr 21, 2026):

We are also experiencing hanging/stalls using qwen 3.5/ qwen 3.6 / qwen3-coder on a Mac Mini M4 PRO with 48gb ram. We are running scans on large text, and it sometimes just stalls in the middle, and never resolves.
We basically have a large text, and we start different conversations asking a single question on that text in each conversation. The changes between conversations are minor, and still it might get stuck after 6 out of 9 questions. Meaning it is something that the models/ollama was already able to answer, most of the prompt is already in the kv cache, and it still just stalls.

<!-- gh-comment-id:4287750129 --> @omerts commented on GitHub (Apr 21, 2026): We are also experiencing hanging/stalls using qwen 3.5/ qwen 3.6 / qwen3-coder on a Mac Mini M4 PRO with 48gb ram. We are running scans on large text, and it sometimes just stalls in the middle, and never resolves. We basically have a large text, and we start different conversations asking a single question on that text in each conversation. The changes between conversations are minor, and still it might get stuck after 6 out of 9 questions. Meaning it is something that the models/ollama was already able to answer, most of the prompt is already in the kv cache, and it still just stalls.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#26543