[GH-ISSUE #1458] Ollama hung after 30 minute of use #26543

New Issue

GiteaMirror · 2026-04-22T02:52:44-05:00

GiteaMirror commented

2026-04-22 02:52:44 -05:00

Originally created by @lfoppiano on GitHub (Dec 11, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1458

I'm running Ollama on my mac M1 and I'm trying to use the 7b models for processing batches of questions / answers.
I noticed that after a while ollama just hang and the process stay there forever.

Is there a way to know what's going on?

I did not find a way to get to the logs.

Thank you in advance

Originally created by @lfoppiano on GitHub (Dec 11, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1458 I'm running Ollama on my mac M1 and I'm trying to use the 7b models for processing batches of questions / answers. I noticed that after a while ollama just hang and the process stay there forever. Is there a way to know what's going on? I did not find a way to get to the logs. Thank you in advance

GiteaMirror closed this issue

2026-04-22 02:52:45 -05:00

GiteaMirror commented

2026-04-22 02:52:46 -05:00

@igorschlum commented on GitHub (Dec 11, 2023):

Hi @lfoppiano
How much memory you have on your Mac M1 and which 7B model do you use?

This command displays Ollama's log on Mac

cat ~/.ollama/logs/server.log

@igorschlum commented on GitHub (Dec 11, 2023): Hi @lfoppiano How much memory you have on your Mac M1 and which 7B model do you use? This command displays Ollama's log on Mac cat ~/.ollama/logs/server.log

GiteaMirror commented

2026-04-22 02:52:48 -05:00

@lfoppiano commented on GitHub (Dec 11, 2023):

Thanks. I have 32Gb of memory.

@lfoppiano commented on GitHub (Dec 11, 2023): Thanks. I have 32Gb of memory.

GiteaMirror commented

2026-04-22 02:52:49 -05:00

@igorschlum commented on GitHub (Dec 11, 2023):

I have also a 32GB Mac M1. Can you provide a sample script so we can test it on our side to reproduce the issue?
Which LLM do you use. Llama2:7b ?

@igorschlum commented on GitHub (Dec 11, 2023): I have also a 32GB Mac M1. Can you provide a sample script so we can test it on our side to reproduce the issue? Which LLM do you use. Llama2:7b ?

GiteaMirror commented

2026-04-22 02:52:52 -05:00

@salbahra commented on GitHub (Dec 18, 2023):

I am also having this issue. Ollama is working great for small batches and single messages however with a very large batch (running more than 30 minutes) it eventually stalls. I have to quit Ollama and restart it for it resume functionality properly.

I am using an M3 128GB MacBook and the model I'm using is Mixtral.

@salbahra commented on GitHub (Dec 18, 2023): I am also having this issue. Ollama is working great for small batches and single messages however with a very large batch (running more than 30 minutes) it eventually stalls. I have to quit Ollama and restart it for it resume functionality properly. I am using an M3 128GB MacBook and the model I'm using is Mixtral.

GiteaMirror commented

2026-04-22 02:52:52 -05:00

@modeseven commented on GitHub (Dec 25, 2023):

Same issue for me.. running ollama in Docker image, works great on fresh startup. after a few prompts or about 30 mins it goes to lala land and I have to bounce my container. doesn't matter which model I'm running.

@modeseven commented on GitHub (Dec 25, 2023): Same issue for me.. running ollama in Docker image, works great on fresh startup. after a few prompts or about 30 mins it goes to lala land and I have to bounce my container. doesn't matter which model I'm running.

GiteaMirror commented

2026-04-22 02:52:53 -05:00

@technovangelist commented on GitHub (Dec 26, 2023):

Hi everyone, thanks for submitting this issue. Which models are being used here? The original poster said it was a 7billion parameter model so I would love to get an example and try to reproduce. Then someone mentioned mixtral which is closer to 50 billion parameters. It shouldn't happen anywhere. I just want to start looking in the right place.

@technovangelist commented on GitHub (Dec 26, 2023): Hi everyone, thanks for submitting this issue. Which models are being used here? The original poster said it was a 7billion parameter model so I would love to get an example and try to reproduce. Then someone mentioned mixtral which is closer to 50 billion parameters. It shouldn't happen anywhere. I just want to start looking in the right place.

GiteaMirror commented

2026-04-22 02:52:54 -05:00

@technovangelist commented on GitHub (Dec 26, 2023):

I tried running a series of about 200 questions through mixtral. This took about 40 minutes to run and memory used hovered around 40GB. I had to run through a lot more questions on a 7b model to get over 40 minutes, but it also ran as expected. So can I get a bit more info from @modeseven @salbahra and @lfoppiano.

How are you running Ollama? What platform (Mac, Linux, or WSL2)? On Docker or not? How much RAM? What video card and how much VRAM? What model are you using? This information will help us track down where the issue might be.

@technovangelist commented on GitHub (Dec 26, 2023): I tried running a series of about 200 questions through mixtral. This took about 40 minutes to run and memory used hovered around 40GB. I had to run through a lot more questions on a 7b model to get over 40 minutes, but it also ran as expected. So can I get a bit more info from @modeseven @salbahra and @lfoppiano. How are you running Ollama? What platform (Mac, Linux, or WSL2)? On Docker or not? How much RAM? What video card and how much VRAM? What model are you using? This information will help us track down where the issue might be.

GiteaMirror commented

2026-04-22 02:52:54 -05:00

@salbahra commented on GitHub (Dec 26, 2023):

@technovangelist Thank you so much for looking into this! For me, I am using mixtral instruct 8bit and fp16 on an M3 128GB RAM. My code is something as follows:

import json
import time
import threading
from typing import Optional

from langchain_experimental.llms.ollama_functions import OllamaFunctions
from langchain.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.chains.openai_functions.utils import _convert_schema, _resolve_schema_references, get_llm_kwargs

llm = OllamaFunctions(model="jmorgan/mixtral", temperature=0.1)

class FunctionThread(threading.Thread):
    def __init__(self, func, args):
        threading.Thread.__init__(self)
        self.func = func
        self.args = args
        self.result = None
        self.exception = None

    def run(self):
        try:
            self.result = self.func(*self.args)
        except Exception as e:
            self.exception = e

def call_llm_with_timeout(func, args, timeout=10 * 60):
    thread = FunctionThread(func, args)
    thread.start()
    thread.join(timeout)
    if thread.is_alive():
        raise TimeoutError("Function call exceeded time limit")
    if thread.exception:
        raise thread.exception
    return thread.result

def call_llm(template, reply_shape, arguments):
    """Calls the language model with the provided template and arguments."""

    openai_schema = reply_shape.schema()
    openai_schema = _resolve_schema_references(
        openai_schema, openai_schema.get("definitions", {})
    )

    function = {
        "name": "information_extraction",
        "description": "Extracts the relevant information from the pathology report.",
        "parameters": {
            "type": "object",
            "properties": {
                "info": _convert_schema(openai_schema)
            },
            "required": ["info"],
        },
    }

    prompt = PromptTemplate(
        template=template,
        input_variables=["report"]
    )
    input = prompt.format_prompt(**arguments)

    debug_print(f"Calling LLM with input: {input.to_string()}")

    retries = 3
    for attempt in range(retries):
        try:
            output = call_llm_with_timeout(
                lambda: llm.bind(**get_llm_kwargs(function)).invoke(input.to_string()), 
                args=()
            )
            debug_print('Unprocessed output:', output.additional_kwargs['function_call'])
            output = json.loads(output.additional_kwargs['function_call']['arguments'])['info']
            debug_print('Processed output:', output)
            return output
        except Exception as e:
            debug_print(f'Error parsing output: {e}')
            if attempt < retries - 1:
                debug_print(f'Retrying... Attempt {attempt + 2}/{retries}')
                time.sleep(1)  # Wait for 1 second before retrying
            else:
                debug_print(f'Retries exhausted. Returning None.')
                return None

I added the timeout logic as a first attempt to get around this issue (but also to ensure the function call return conforms to the expected shape. I didn't find a way to do this via LangChain. I then run a for loop which executes many call_llm calls and runs for 2-3 hours sometimes before stopping to respond. Five different prompts/inputs are used throughout each for loop iteration. If you need me to share more code if it would help let me know please.

Thank you!

@salbahra commented on GitHub (Dec 26, 2023): @technovangelist Thank you so much for looking into this! For me, I am using mixtral instruct 8bit and fp16 on an M3 128GB RAM. My code is something as follows: ``` import json import time import threading from typing import Optional from langchain_experimental.llms.ollama_functions import OllamaFunctions from langchain.prompts import PromptTemplate from langchain_core.pydantic_v1 import BaseModel, Field from langchain.chains.openai_functions.utils import _convert_schema, _resolve_schema_references, get_llm_kwargs llm = OllamaFunctions(model="jmorgan/mixtral", temperature=0.1) class FunctionThread(threading.Thread): def __init__(self, func, args): threading.Thread.__init__(self) self.func = func self.args = args self.result = None self.exception = None def run(self): try: self.result = self.func(*self.args) except Exception as e: self.exception = e def call_llm_with_timeout(func, args, timeout=10 * 60): thread = FunctionThread(func, args) thread.start() thread.join(timeout) if thread.is_alive(): raise TimeoutError("Function call exceeded time limit") if thread.exception: raise thread.exception return thread.result def call_llm(template, reply_shape, arguments): """Calls the language model with the provided template and arguments.""" openai_schema = reply_shape.schema() openai_schema = _resolve_schema_references( openai_schema, openai_schema.get("definitions", {}) ) function = { "name": "information_extraction", "description": "Extracts the relevant information from the pathology report.", "parameters": { "type": "object", "properties": { "info": _convert_schema(openai_schema) }, "required": ["info"], }, } prompt = PromptTemplate( template=template, input_variables=["report"] ) input = prompt.format_prompt(**arguments) debug_print(f"Calling LLM with input: {input.to_string()}") retries = 3 for attempt in range(retries): try: output = call_llm_with_timeout( lambda: llm.bind(**get_llm_kwargs(function)).invoke(input.to_string()), args=() ) debug_print('Unprocessed output:', output.additional_kwargs['function_call']) output = json.loads(output.additional_kwargs['function_call']['arguments'])['info'] debug_print('Processed output:', output) return output except Exception as e: debug_print(f'Error parsing output: {e}') if attempt < retries - 1: debug_print(f'Retrying... Attempt {attempt + 2}/{retries}') time.sleep(1) # Wait for 1 second before retrying else: debug_print(f'Retries exhausted. Returning None.') return None ``` I added the timeout logic as a first attempt to get around this issue (but also to ensure the function call return conforms to the expected shape. I didn't find a way to do this via LangChain. I then run a for loop which executes many `call_llm` calls and runs for 2-3 hours sometimes before stopping to respond. Five different prompts/inputs are used throughout each for loop iteration. If you need me to share more code if it would help let me know please. Thank you!

GiteaMirror commented

2026-04-22 02:52:55 -05:00

@tubnt commented on GitHub (Dec 29, 2023):

m3+64g macos
i7+64g+4090+windows+docker+wsl2

I found this problem on both systems. When I output long novels, I usually get stuck after typing three times in the same conversation without any error. Just the output stopped.

@tubnt commented on GitHub (Dec 29, 2023): m3+64g macos i7+64g+4090+windows+docker+wsl2 I found this problem on both systems. When I output long novels, I usually get stuck after typing three times in the same conversation without any error. Just the output stopped.

GiteaMirror commented

2026-04-22 02:52:56 -05:00

@technovangelist commented on GitHub (Jan 3, 2024):

We made some updates to the models pretty recently. Can you try repulling the models (ollama pull mixtral) to ensure you have the latest? @lfoppiano @modeseven @tubnt

and @salbahra I see you are using Jeff's first version of mixtral. Can you try switching over to the library version at https://ollama.ai/library/mixtral

@technovangelist commented on GitHub (Jan 3, 2024): We made some updates to the models pretty recently. Can you try repulling the models (`ollama pull mixtral`) to ensure you have the latest? @lfoppiano @modeseven @tubnt and @salbahra I see you are using Jeff's first version of mixtral. Can you try switching over to the library version at https://ollama.ai/library/mixtral

GiteaMirror commented

2026-04-22 02:52:57 -05:00

@cloudnativeengineer commented on GitHub (Jan 9, 2024):

I think I found something similar. Using (version HEAD-6164f37) with the command for instance in $(seq 1 17); do ollama run nous-hermes2:10.7b-solar-q4_K_M Hello; done, the ollama serve will stop generating text on the 17th run and won't process requests normally until ollama serve is restarted.

@cloudnativeengineer commented on GitHub (Jan 9, 2024): I think I found something similar. Using `(version HEAD-6164f37)` with the command `for instance in $(seq 1 17); do ollama run nous-hermes2:10.7b-solar-q4_K_M Hello; done`, the `ollama serve` will stop generating text on the 17th run and won't process requests normally until `ollama serve` is restarted.

GiteaMirror commented

2026-04-22 02:52:58 -05:00

@salbahra commented on GitHub (Jan 9, 2024):

@cloudnativeengineer I am also seeing issues with Nous Hermes. In fact I can consistently reproduce it with nous-hermes:70b-llama2-q6_K after just a few runs. One thing to note, when I quit Ollama from the status bar, it continued running as shown in the Activity Monitor and after force quitting it and relaunching things worked again.

@technovangelist After pulling the latest Mixtral model, I cannot reproduce the hanging with that model. This is making me think it is model specific and right now the Nous Hermes is the most affected in my testing.

@salbahra commented on GitHub (Jan 9, 2024): @cloudnativeengineer I am also seeing issues with Nous Hermes. In fact I can consistently reproduce it with `nous-hermes:70b-llama2-q6_K` after just a few runs. One thing to note, when I quit Ollama from the status bar, it continued running as shown in the Activity Monitor and after force quitting it and relaunching things worked again. @technovangelist After pulling the latest Mixtral model, I cannot reproduce the hanging with that model. This is making me think it is model specific and right now the Nous Hermes is the most affected in my testing.

GiteaMirror commented

2026-04-22 02:52:59 -05:00

@igorschlum commented on GitHub (Jan 9, 2024):

OK, I will take back to problem. I will run @technovangelist script on all models. Il will try to modify the script to run 1000 call for each api. It could be nice to have a memory status at the end of each 1000 api call. We will see if it's a problem of ollama or a problem running some models.

@jmorganca Do you run those kind of tests?

@igorschlum commented on GitHub (Jan 9, 2024): OK, I will take back to problem. I will run @technovangelist script on all models. Il will try to modify the script to run 1000 call for each api. It could be nice to have a memory status at the end of each 1000 api call. We will see if it's a problem of ollama or a problem running some models. @jmorganca Do you run those kind of tests?

GiteaMirror commented

2026-04-22 02:52:59 -05:00

@jmorganca commented on GitHub (Feb 20, 2024):

We do – this is an issue that's been fixed as of the last release. I'll close this for now but please do report if the issue is still happening

@jmorganca commented on GitHub (Feb 20, 2024): We do – this is an issue that's been fixed as of the last release. I'll close this for now but please do report if the issue is still happening

GiteaMirror commented

2026-04-22 02:53:00 -05:00

@kuccello commented on GitHub (Mar 3, 2024):

Encountering the issue described above:
I have downloaded the latest version of ollama (as of March 3, 2024) and am trying to run the nomic-embed-text with about 500 words - it works on the first two prompts for embeddings and then the process hangs - I'm running on M1 Ultra (Mac Studio first edition - 128G ram). Calling the api via node-fetch.

async function getEmbeddings(prompt: string): Promise<EmbeddingsResponse> {
  const response = await fetch('http://127.0.0.1:11434/api/embeddings', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      // model: 'nomic-embed-text',
      model: 'all-minilm',
      prompt
    })
  });

  if (!response.ok) {
    throw new Error(`HTTP error! status: ${response.status}`);
  }
  // TODO validate response data
  return response.json() as Promise<EmbeddingsResponse>;
}

Output from process running ollama serve

❯ ollama serve
time=2024-03-03T12:54:37.029-08:00 level=INFO source=images.go:710 msg="total blobs: 29"
time=2024-03-03T12:54:37.031-08:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-03T12:54:37.032-08:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)"
time=2024-03-03T12:54:37.032-08:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-03T12:54:37.049-08:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [metal]"
loading library /var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798/metal/libext_server.dylib
time=2024-03-03T12:54:52.976-08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798/metal/libext_server.dylib"
time=2024-03-03T12:54:52.976-08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
llama_model_loader: loaded meta data with 23 key-value pairs and 101 tensors from /Users/kuccello/.ollama/models/blobs/sha256:797b70c4edf85907fe0a49eb85811256f65fa0f7bf52166b147fd16be2be4662 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = all-MiniLM-L6-v2
llama_model_loader: - kv   2:                           bert.block_count u32              = 6
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 384
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 1536
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:   63 tensors
llama_model_loader: - type  f16:   38 tensors
llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 512
llm_load_print_meta: n_embd           = 384
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_layer          = 6
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 32
llm_load_print_meta: n_embd_head_v    = 32
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 384
llm_load_print_meta: n_embd_v_gqa     = 384
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 1536
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 512
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 22M
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 22.57 M
llm_load_print_meta: model size       = 43.10 MiB (16.02 BPW) 
llm_load_print_meta: general.name     = all-MiniLM-L6-v2
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_tensors: ggml ctx size =    0.08 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =    20.38 MiB, (   20.44 / 98304.00)
llm_load_tensors: offloading 6 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 7/7 layers to GPU
llm_load_tensors:        CPU buffer size =    22.73 MiB
llm_load_tensors:      Metal buffer size =    20.37 MiB
...............................
llama_new_context_with_model: n_ctx      = 256
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Ultra
ggml_metal_init: picking default device: Apple M1 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = /var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798
ggml_metal_init: loading '/var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 103079.22 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     2.25 MiB, (   24.50 / 98304.00)
llama_kv_cache_init:      Metal KV buffer size =     2.25 MiB
llama_new_context_with_model: KV self size  =    2.25 MiB, K (f16):    1.12 MiB, V (f16):    1.12 MiB
llama_new_context_with_model:        CPU input buffer size   =     2.26 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     4.27 MiB, (   28.77 / 98304.00)
llama_new_context_with_model:      Metal compute buffer size =     4.25 MiB
llama_new_context_with_model:        CPU compute buffer size =     0.75 MiB
llama_new_context_with_model: graph splits (measure): 2
time=2024-03-03T12:54:53.051-08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
[GIN] 2024/03/03 - 12:54:53 | 200 |  328.300208ms |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/03/03 - 12:54:53 | 200 |     473.209µs |       127.0.0.1 | POST     "/api/embeddings"
[1]    57562 killed     ollama serve

@kuccello commented on GitHub (Mar 3, 2024): Encountering the issue described above: I have downloaded the latest version of ollama (as of March 3, 2024) and am trying to run the [nomic-embed-text](https://registry.ollama.ai/library/nomic-embed-text) with about 500 words - it works on the first two prompts for embeddings and then the process hangs - I'm running on M1 Ultra (Mac Studio first edition - 128G ram). Calling the api via node-fetch. ```ts async function getEmbeddings(prompt: string): Promise<EmbeddingsResponse> { const response = await fetch('http://127.0.0.1:11434/api/embeddings', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ // model: 'nomic-embed-text', model: 'all-minilm', prompt }) }); if (!response.ok) { throw new Error(`HTTP error! status: ${response.status}`); } // TODO validate response data return response.json() as Promise<EmbeddingsResponse>; } ``` Output from process running `ollama serve` ``` ❯ ollama serve time=2024-03-03T12:54:37.029-08:00 level=INFO source=images.go:710 msg="total blobs: 29" time=2024-03-03T12:54:37.031-08:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-03T12:54:37.032-08:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)" time=2024-03-03T12:54:37.032-08:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-03T12:54:37.049-08:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [metal]" loading library /var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798/metal/libext_server.dylib time=2024-03-03T12:54:52.976-08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798/metal/libext_server.dylib" time=2024-03-03T12:54:52.976-08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" llama_model_loader: loaded meta data with 23 key-value pairs and 101 tensors from /Users/kuccello/.ollama/models/blobs/sha256:797b70c4edf85907fe0a49eb85811256f65fa0f7bf52166b147fd16be2be4662 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = all-MiniLM-L6-v2 llama_model_loader: - kv 2: bert.block_count u32 = 6 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 384 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 1536 llama_model_loader: - kv 6: bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 63 tensors llama_model_loader: - type f16: 38 tensors llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 30522 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 384 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_layer = 6 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_embd_head_k = 32 llm_load_print_meta: n_embd_head_v = 32 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 384 llm_load_print_meta: n_embd_v_gqa = 384 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 1536 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 22M llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 22.57 M llm_load_print_meta: model size = 43.10 MiB (16.02 BPW) llm_load_print_meta: general.name = all-MiniLM-L6-v2 llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_tensors: ggml ctx size = 0.08 MiB ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 20.38 MiB, ( 20.44 / 98304.00) llm_load_tensors: offloading 6 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 7/7 layers to GPU llm_load_tensors: CPU buffer size = 22.73 MiB llm_load_tensors: Metal buffer size = 20.37 MiB ............................... llama_new_context_with_model: n_ctx = 256 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: Apple M1 Ultra ggml_metal_init: picking default device: Apple M1 Ultra ggml_metal_init: default.metallib not found, loading from source ggml_metal_init: GGML_METAL_PATH_RESOURCES = /var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798 ggml_metal_init: loading '/var/folders/p0/03k5z_zj1jx511bsflprb9mr0000gn/T/ollama1188978798/ggml-metal.metal' ggml_metal_init: GPU name: Apple M1 Ultra ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 103079.22 MB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 2.25 MiB, ( 24.50 / 98304.00) llama_kv_cache_init: Metal KV buffer size = 2.25 MiB llama_new_context_with_model: KV self size = 2.25 MiB, K (f16): 1.12 MiB, V (f16): 1.12 MiB llama_new_context_with_model: CPU input buffer size = 2.26 MiB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 4.27 MiB, ( 28.77 / 98304.00) llama_new_context_with_model: Metal compute buffer size = 4.25 MiB llama_new_context_with_model: CPU compute buffer size = 0.75 MiB llama_new_context_with_model: graph splits (measure): 2 time=2024-03-03T12:54:53.051-08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" [GIN] 2024/03/03 - 12:54:53 | 200 | 328.300208ms | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/03/03 - 12:54:53 | 200 | 473.209µs | 127.0.0.1 | POST "/api/embeddings" [1] 57562 killed ollama serve ```

GiteaMirror commented

2026-04-22 02:53:01 -05:00

@smarti57 commented on GitHub (Apr 17, 2024):

I'm running into this myself with mistral:7b-instruct-v0.2-q6_K with ollama 0.1.31. I've got a repeating process that does 8 interactions on an input file and then moves on to the next file. I'm actually using the 7B to test on a linux workstation (128GB RAM) with an NVIDIA GPU (16GB VRAM) but have also been running on an M2 Max 64GB laptop to use a larger model. Looking at nvtop on the workstation the ollama process is using roughly 6.5GB of VRAM with no creep. Both seem to stall with high GPU/CPU utilization after the 40th or so interaction. At the same time, I've left ollama up for weeks on end backing a Big-AGI intstance and never had any issues, on the same machine. Could there be something going on with either the rapid injection of chats or possibly the model context creeping up? I know there was a ticket out for resetting context state to an initial state, that's why I'm wondering. At this point I'm just considering restarting ollama and taking the model reload hit every 4 runs or so, that would solve it, but be a real waste of time.

In looking over the log it seems to just be spitting this out over and over (1000s of entries) when it hangs,

Apr 17 19:29:16 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1597,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slo>

Every 13 seconds an entry is put in the log, same information. Prior to that I get kind of normal, expected log entries per intraction

Apr 17 18:57:47 ai ollama[1781]: [GIN] 2024/04/17 - 18:57:47 | 200 | 5.255910821s | 192.168.1.94 | POST "/api/chat"
Apr 17 18:57:47 ai ollama[1781]: {"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":6886,"tid":"139863497086528","timestamp":1713380267}
Apr 17 18:57:47 ai ollama[1781]: {"function":"update_slots","ga_i":0,"level":"INFO","line":1805,"msg":"slot progression","n_past":1,"n_past_se":0,"n_prompt_tokens_processed":1822,"slot_id":0,"task_id":6886,"tid":"139863497086528">
Apr 17 18:57:47 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1832,"msg":"kv cache rm [p0, end)","p0":1,"slot_id":0,"task_id":6886,"tid":"139863497086528","timestamp":1713380267}
Apr 17 18:57:51 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1597,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slo>
Apr 17 18:58:03 ai ollama[1781]: {"function":"print_timings","level":"INFO","line":265,"msg":"prompt eval time = 829.73 ms / 1822 tokens ( 0.46 ms per token, 2195.91 tokens per second)","n_prompt_tokens_processed":18>
Apr 17 18:58:03 ai ollama[1781]: {"function":"print_timings","level":"INFO","line":279,"msg":"generation eval time = 15314.43 ms / 1143 runs ( 13.40 ms per token, 74.64 tokens per second)","n_decoded":1143,"n_tokens_sec>
Apr 17 18:58:03 ai ollama[1781]: {"function":"print_timings","level":"INFO","line":289,"msg":" total time = 16144.16 ms","slot_id":0,"t_prompt_processing":829.725,"t_token_generation":15314.43,"t_total":16144.155,"task>
Apr 17 18:58:03 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1636,"msg":"slot released","n_cache_tokens":1943,"n_ctx":2048,"n_past":1942,"n_system_tokens":0,"slot_id":0,"task_id":6886,"tid":"139863497086528",>
Apr 17 18:58:03 ai ollama[1781]: [GIN] 2024/04/17 - 18:58:03 | 200 | 16.158508055s | 192.168.1.94 | POST "/api/chat"

@smarti57 commented on GitHub (Apr 17, 2024): I'm running into this myself with mistral:7b-instruct-v0.2-q6_K with ollama 0.1.31. I've got a repeating process that does 8 interactions on an input file and then moves on to the next file. I'm actually using the 7B to test on a linux workstation (128GB RAM) with an NVIDIA GPU (16GB VRAM) but have also been running on an M2 Max 64GB laptop to use a larger model. Looking at nvtop on the workstation the ollama process is using roughly 6.5GB of VRAM with no creep. Both seem to stall with high GPU/CPU utilization after the 40th or so interaction. At the same time, I've left ollama up for weeks on end backing a Big-AGI intstance and never had any issues, on the same machine. Could there be something going on with either the rapid injection of chats or possibly the model context creeping up? I know there was a ticket out for resetting context state to an initial state, that's why I'm wondering. At this point I'm just considering restarting ollama and taking the model reload hit every 4 runs or so, that would solve it, but be a real waste of time. In looking over the log it seems to just be spitting this out over and over (1000s of entries) when it hangs, **_Apr 17 19:29:16 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1597,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slo>_** Every 13 seconds an entry is put in the log, same information. Prior to that I get kind of normal, expected log entries per intraction **_Apr 17 18:57:47 ai ollama[1781]: [GIN] 2024/04/17 - 18:57:47 | 200 | 5.255910821s | 192.168.1.94 | POST "/api/chat" Apr 17 18:57:47 ai ollama[1781]: {"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":6886,"tid":"139863497086528","timestamp":1713380267} Apr 17 18:57:47 ai ollama[1781]: {"function":"update_slots","ga_i":0,"level":"INFO","line":1805,"msg":"slot progression","n_past":1,"n_past_se":0,"n_prompt_tokens_processed":1822,"slot_id":0,"task_id":6886,"tid":"139863497086528"> Apr 17 18:57:47 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1832,"msg":"kv cache rm [p0, end)","p0":1,"slot_id":0,"task_id":6886,"tid":"139863497086528","timestamp":1713380267} Apr 17 18:57:51 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1597,"msg":"slot context shift","n_cache_tokens":2048,"n_ctx":2048,"n_discard":1023,"n_keep":1,"n_left":2046,"n_past":2047,"n_system_tokens":0,"slo> Apr 17 18:58:03 ai ollama[1781]: {"function":"print_timings","level":"INFO","line":265,"msg":"prompt eval time = 829.73 ms / 1822 tokens ( 0.46 ms per token, 2195.91 tokens per second)","n_prompt_tokens_processed":18> Apr 17 18:58:03 ai ollama[1781]: {"function":"print_timings","level":"INFO","line":279,"msg":"generation eval time = 15314.43 ms / 1143 runs ( 13.40 ms per token, 74.64 tokens per second)","n_decoded":1143,"n_tokens_sec> Apr 17 18:58:03 ai ollama[1781]: {"function":"print_timings","level":"INFO","line":289,"msg":" total time = 16144.16 ms","slot_id":0,"t_prompt_processing":829.725,"t_token_generation":15314.43,"t_total":16144.155,"task> Apr 17 18:58:03 ai ollama[1781]: {"function":"update_slots","level":"INFO","line":1636,"msg":"slot released","n_cache_tokens":1943,"n_ctx":2048,"n_past":1942,"n_system_tokens":0,"slot_id":0,"task_id":6886,"tid":"139863497086528",> Apr 17 18:58:03 ai ollama[1781]: [GIN] 2024/04/17 - 18:58:03 | 200 | 16.158508055s | 192.168.1.94 | POST "/api/chat"_**

GiteaMirror commented

2026-04-22 02:53:02 -05:00

@smarti57 commented on GitHub (Apr 17, 2024):

Just grabbed 0.1.32.... this MAY have been fixed, it's progressing MUCH further than it had previously. It seemed to run longer than before (I got 24 files done before it stalled) but it still stalled out after 17 minutes of continuous use over 24x8 interactions (whatever that works out to).

@smarti57 commented on GitHub (Apr 17, 2024): Just grabbed 0.1.32.... this MAY have been fixed, it's progressing MUCH further than it had previously. It seemed to run longer than before (I got 24 files done before it stalled) but it still stalled out after 17 minutes of continuous use over 24x8 interactions (whatever that works out to).

GiteaMirror commented

2026-04-22 02:53:03 -05:00

@smarti57 commented on GitHub (Apr 18, 2024):

Another update, this seems to be related to mistral. I've run gemma2b just to test it out quickly and it completed all 100 test input files. I'll try a 7B and 13b model and post updates.

@smarti57 commented on GitHub (Apr 18, 2024): Another update, this seems to be related to mistral. I've run gemma2b just to test it out quickly and it completed all 100 test input files. I'll try a 7B and 13b model and post updates.

GiteaMirror commented

2026-04-22 02:53:04 -05:00

@smarti57 commented on GitHub (Apr 23, 2024):

final update. I was able to mitigate this in my workflow by swapping out model files. The unload and reload of a new model with each group (no more than 8) prompts eliminated the lock issue. It's a bug for sure that is still in 0.1.32. Certain models (mistral and llama3 I'm looking at you) seem to exhibit it faster. Upping the ctx to 4096 helped a bit but didn't eliminate the issue.

@smarti57 commented on GitHub (Apr 23, 2024): final update. I was able to mitigate this in my workflow by swapping out model files. The unload and reload of a new model with each group (no more than 8) prompts eliminated the lock issue. It's a bug for sure that is still in 0.1.32. Certain models (mistral and llama3 I'm looking at you) seem to exhibit it faster. Upping the ctx to 4096 helped a bit but didn't eliminate the issue.

GiteaMirror commented

2026-04-22 02:53:04 -05:00

@cloudnativeengineer commented on GitHub (Apr 24, 2024):

@smarti57 thanks for sharing that unload trick! I hadn't heard of or even considered it a possibility.

@cloudnativeengineer commented on GitHub (Apr 24, 2024): @smarti57 thanks for sharing that unload trick! I hadn't heard of or even considered it a possibility.

GiteaMirror commented

2026-04-22 02:53:05 -05:00

@entmike commented on GitHub (Apr 25, 2024):

Another update, this seems to be related to mistral. I've run gemma2b just to test it out quickly and it completed all 100 test input files. I'll try a 7B and 13b model and post updates.

It also is happening with llava. I'll try your unload trick.

@entmike commented on GitHub (Apr 25, 2024): > Another update, this seems to be related to mistral. I've run gemma2b just to test it out quickly and it completed all 100 test input files. I'll try a 7B and 13b model and post updates. It also is happening with llava. I'll try your unload trick.

GiteaMirror commented

2026-04-22 02:53:06 -05:00

@reski-rukmantiyo commented on GitHub (May 5, 2024):

was able to mitigate this in my workflow by swapping out model files. The unload and reload of a new model with each group (no more than 8) prompts eliminated the lock issue

Hi @smarti57 , i try to solve this by upping ctx. Could you share how? because this is still happening on 0.1.33

@reski-rukmantiyo commented on GitHub (May 5, 2024): > was able to mitigate this in my workflow by swapping out model files. The unload and reload of a new model with each group (no more than 8) prompts eliminated the lock issue Hi @smarti57 , i try to solve this by upping ctx. Could you share how? because this is still happening on 0.1.33

GiteaMirror commented

2026-04-22 02:53:07 -05:00

@urlan commented on GitHub (Aug 30, 2025):

Well... it's happening nowadays. I had to implement ollama.chat using concurrent.futures to make a timeout.

@urlan commented on GitHub (Aug 30, 2025): Well... it's happening nowadays. I had to implement ollama.chat using concurrent.futures to make a timeout.

GiteaMirror commented

2026-04-22 02:53:08 -05:00

@omerts commented on GitHub (Apr 21, 2026):

We are also experiencing hanging/stalls using qwen 3.5/ qwen 3.6 / qwen3-coder on a Mac Mini M4 PRO with 48gb ram. We are running scans on large text, and it sometimes just stalls in the middle, and never resolves.
We basically have a large text, and we start different conversations asking a single question on that text in each conversation. The changes between conversations are minor, and still it might get stuck after 6 out of 9 questions. Meaning it is something that the models/ollama was already able to answer, most of the prompt is already in the kv cache, and it still just stalls.

@omerts commented on GitHub (Apr 21, 2026): We are also experiencing hanging/stalls using qwen 3.5/ qwen 3.6 / qwen3-coder on a Mac Mini M4 PRO with 48gb ram. We are running scans on large text, and it sometimes just stalls in the middle, and never resolves. We basically have a large text, and we start different conversations asking a single question on that text in each conversation. The changes between conversations are minor, and still it might get stuck after 6 out of 9 questions. Meaning it is something that the models/ollama was already able to answer, most of the prompt is already in the kv cache, and it still just stalls.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#26543