[GH-ISSUE #12193] Regression in v0.11.10: Crash with LoRA on Apple Silicon #8109

Open
opened 2026-04-12 20:26:28 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @mingqxu on GitHub (Sep 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12193

What is the issue?

Based on the logs, the key difference is a fundamental architectural change in how Ollama launches its model runner and manages memory between versions 0.8.0 and 0.11.10.

The newer version (0.11.10) appears to have a regression in its memory allocation logic that fails with phi3:mini specific model and LoRA combination, whereas the older version's method was more robust and handled it correctly.

----------------------------- here is the adapter file -------------------
FROM phi3:mini
ADAPTER ./adapter.gguf

SYSTEM """
You are a reliability/FMEA assistant. Be concise and structured.
"""

TEMPLATE "{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"
PARAMETER stop <|end|>
PARAMETER stop <|user|>
PARAMETER stop <|assistant|>


1. Runner Architecture and Configuration

The most visible difference is how the main ollama serve process starts the backend runner.

  • Ollama 0.8.0 (Successful): This version builds a long, detailed command to launch the runner, explicitly passing arguments like context size and the LoRA path.
    time=... source=server.go:431 msg="starting llama server" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model ... --ctx-size 8192 ... --lora /Users/mingqxu/.ollama/models/blobs/sha256-776ceb..."
    
  • Ollama 0.11.10 (Crashing): This version starts a much simpler, more generic runner process. The configuration (like which LoRA to use) is handled internally after the process starts, not through command-line arguments.
    time=... source=server.go:398 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/mingqxu/.ollama/models/blobs/sha256-633... --port 52928"
    

What this means: The logic for configuring the model session was completely refactored. The older method was more direct, while the newer method is more abstract. This new abstraction seems to be where the problem lies.


0.8.0.log
0.11.10.log

  1. Memory Management and Allocation

This is the most critical difference and the direct cause of the crash. It's quite counter-intuitive.

  • Ollama 0.8.0 (Successful):

    • Estimated a 6.1 GiB total memory requirement.
    • Successfully allocated a KV cache for an 8192-token context (llama_context: n_ctx = 8192).
    • It correctly calculated the larger memory footprint needed and successfully allocated it.
  • Ollama 0.11.10 (Crashing):

    • Estimated a smaller 4.3 GiB memory requirement.
    • Attempted to use a smaller 4096-token context (llama_context: n_ctx = 4096).
    • Despite needing less memory overall, it failed to allocate a tiny piece of it inside a specific memory pool (ggml_new_object: not enough space...).

What this means: The issue in the new version isn't a lack of total RAM on your machine. The problem is that its internal memory pool calculation is flawed. When loading your model plus the LoRA adapter, it miscalculates the size needed for its context memory pool, making it slightly too small. When the first inference request comes in, it tries to allocate an object that won't fit, and the whole thing crashes.

The older v0.8.0 was better at correctly estimating and allocating the necessary resources for this specific combination.

Relevant log output

see two log files for details.

OS

macOS

GPU

No response

CPU

Apple

Ollama version

0.11.10

Originally created by @mingqxu on GitHub (Sep 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12193 ### What is the issue? Based on the logs, the key difference is a **fundamental architectural change** in how Ollama launches its model runner and manages memory between versions 0.8.0 and 0.11.10. The newer version (0.11.10) appears to have a **regression** in its memory allocation logic that fails with phi3:mini specific model and LoRA combination, whereas the older version's method was more robust and handled it correctly. ----------------------------- here is the adapter file ------------------- FROM phi3:mini ADAPTER ./adapter.gguf SYSTEM """ You are a reliability/FMEA assistant. Be concise and structured. """ TEMPLATE "{{ if .System }}<|system|> {{ .System }}<|end|> {{ end }}{{ if .Prompt }}<|user|> {{ .Prompt }}<|end|> {{ end }}<|assistant|> {{ .Response }}<|end|>" PARAMETER stop <|end|> PARAMETER stop <|user|> PARAMETER stop <|assistant|> ----- ### 1. Runner Architecture and Configuration The most visible difference is how the main `ollama serve` process starts the backend runner. * **Ollama 0.8.0 (Successful):** This version builds a long, detailed command to launch the runner, explicitly passing arguments like context size and the LoRA path. ```text time=... source=server.go:431 msg="starting llama server" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model ... --ctx-size 8192 ... --lora /Users/mingqxu/.ollama/models/blobs/sha256-776ceb..." ``` * **Ollama 0.11.10 (Crashing):** This version starts a much simpler, more generic runner process. The configuration (like which LoRA to use) is handled internally after the process starts, not through command-line arguments. ```text time=... source=server.go:398 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/mingqxu/.ollama/models/blobs/sha256-633... --port 52928" ``` **What this means:** The logic for configuring the model session was completely refactored. The older method was more direct, while the newer method is more abstract. This new abstraction seems to be where the problem lies. ----- ### [0.8.0.log](https://github.com/user-attachments/files/22173249/0.8.0.log) [0.11.10.log](https://github.com/user-attachments/files/22173250/0.11.10.log) 2. Memory Management and Allocation This is the most critical difference and the direct cause of the crash. It's quite counter-intuitive. * **Ollama 0.8.0 (Successful):** * Estimated a **6.1 GiB** total memory requirement. * Successfully allocated a KV cache for an **8192-token context** (`llama_context: n_ctx = 8192`). * It correctly calculated the larger memory footprint needed and successfully allocated it. * **Ollama 0.11.10 (Crashing):** * Estimated a smaller **4.3 GiB** memory requirement. * Attempted to use a smaller **4096-token context** (`llama_context: n_ctx = 4096`). * Despite needing *less* memory overall, it failed to allocate a tiny piece of it inside a specific memory pool (`ggml_new_object: not enough space...`). **What this means:** The issue in the new version isn't a lack of total RAM on your machine. The problem is that its **internal memory pool calculation is flawed**. When loading your model *plus* the LoRA adapter, it miscalculates the size needed for its context memory pool, making it slightly too small. When the first inference request comes in, it tries to allocate an object that won't fit, and the whole thing crashes. The older v0.8.0 was better at correctly estimating and allocating the necessary resources for this specific combination. ### Relevant log output ```shell see two log files for details. ``` ### OS macOS ### GPU _No response_ ### CPU Apple ### Ollama version 0.11.10
GiteaMirror added the bug label 2026-04-12 20:26:28 -05:00
Author
Owner

@jessegross commented on GitHub (Sep 5, 2025):

This is not a memory allocation issue, it looks like a bug in the old runner (llama.cpp). You could try it without the lora or fusing it to see if that helps.

Please stick to the facts with bug reports, the AI analysis is not helpful.

<!-- gh-comment-id:3259391332 --> @jessegross commented on GitHub (Sep 5, 2025): This is not a memory allocation issue, it looks like a bug in the old runner (llama.cpp). You could try it without the lora or fusing it to see if that helps. Please stick to the facts with bug reports, the AI analysis is not helpful.
Author
Owner

@mingqxu commented on GitHub (Sep 6, 2025):

Yes, you’re right — fusing the model and running the fused version under Ollama 0.11.10 works fine. At the same time, I’d still like to run the adapter, since it provides the flexibility to swap adapters as needed. I really look forward to the upcoming fix.

<!-- gh-comment-id:3260302088 --> @mingqxu commented on GitHub (Sep 6, 2025): Yes, you’re right — fusing the model and running the fused version under Ollama 0.11.10 works fine. At the same time, I’d still like to run the adapter, since it provides the flexibility to swap adapters as needed. I really look forward to the upcoming fix.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8109