[GH-ISSUE #12443] Different load times for an already loaded model, Why? #8266

Open
opened 2026-04-12 20:48:26 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @yisuscc on GitHub (Sep 29, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12443

I'm currently doing a benchmarking tool with ollama and noticed that it reports a different load time for each prompt fed (see the column load_duration in the picture below) when the model should stay loaded in memory (the keep_alive is set at -1 ). The same thing happens in all models tested (gemma3n, mistral,qwen2.5:0.5b ...)

Image

For comparison, here is the llama.cpp results for a different model, notice how the load duration is the same across all prompts.

Image

The inference code for both versions:

  • Ollama:

def query_event(self, prompt: str, model: str, event: Event, prompt_id: int = -1, keep_alive='-1') -> PromptMetrics: start = time.time_ns() event.set() response: GenerateResponse = self.client.generate(prompt=prompt, model=model, keep_alive=keep_alive) finish = time.time_ns() event.clear() return PromptMetrics.ollama_pseudoconstructor(start, finish, response, prompt_id)

  • llama.cpp:
    def query_event(self,prompt:str,event:Event, prompt_id=-1)-> PromptMetrics: #limpiaos el contexto de rendimiento llama_perf_context_reset(self.ctx) event.set() #so the time is a bit more accurate start= time.time_ns() #We query the machine response= self(prompt) finish = time.time_ns() event.clear() return PromptMetrics.llama_cpp_pseudoconstructor(start, finish, self.get_performance_metrics(), self.get_name(), prompt_id)
Originally created by @yisuscc on GitHub (Sep 29, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12443 I'm currently doing a benchmarking tool with ollama and noticed that it reports a different load time for each prompt fed (see the column _load_duration_ in the picture below) when the model should stay loaded in memory (the keep_alive is set at -1 ). The same thing happens in all models tested (gemma3n, mistral,qwen2.5:0.5b ...) <img width="1421" height="107" alt="Image" src="https://github.com/user-attachments/assets/329b63ed-0b33-4ad6-8440-537db10f24ec" /> For comparison, here is the llama.cpp results for a different model, notice how the load duration is the same across all prompts. <img width="1421" height="107" alt="Image" src="https://github.com/user-attachments/assets/4831a2fd-1924-4d52-9744-b6940617d8ed" /> The inference code for both versions: - Ollama: ` def query_event(self, prompt: str, model: str, event: Event, prompt_id: int = -1, keep_alive='-1') -> PromptMetrics: start = time.time_ns() event.set() response: GenerateResponse = self.client.generate(prompt=prompt, model=model, keep_alive=keep_alive) finish = time.time_ns() event.clear() return PromptMetrics.ollama_pseudoconstructor(start, finish, response, prompt_id)` - llama.cpp: `def query_event(self,prompt:str,event:Event, prompt_id=-1)-> PromptMetrics: #limpiaos el contexto de rendimiento llama_perf_context_reset(self.ctx) event.set() #so the time is a bit more accurate start= time.time_ns() #We query the machine response= self(prompt) finish = time.time_ns() event.clear() return PromptMetrics.llama_cpp_pseudoconstructor(start, finish, self.get_performance_metrics(), self.get_name(), prompt_id)`
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8266