[GH-ISSUE #14629] Ollama very slow with LLM video on Mac mini M4 16gb #35238

Open
opened 2026-04-22 19:37:27 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @vpoma777 on GitHub (Mar 4, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14629

What is the issue?

Environment

  • Hardware: Mac mini M4 16GB unified memory
  • OS: macOS
  • Ollama version: 0.17.6
  • Model: llama3.2-vision:11b (running 100% GPU via Metal)

Problem

Every single inference call takes ~20 seconds regardless of any optimization attempted. The visual encoder latency is not reported in the API response breakdown, making it invisible but dominant.

What I tested

  • Image sizes: 24KB to 267KB → no difference
  • Image resolutions: 560x560 to 1920x1080 → no difference
  • OLLAMA_FLASH_ATTENTION=1 → no difference
  • OLLAMA_KV_CACHE_TYPE=q8_0 → no difference
  • Warm cache vs cold cache → no difference
  • Killed duplicate Ollama instance (app + CLI conflict) → no difference

API response breakdown

total_duration: ~20s
load_duration: ~0.1s
prompt_eval_duration: ~0.75s
eval_duration: ~0.46s
─────────────────────────
Accounted for: ~1.3s
Unaccounted: ~19s ← visual encoder overhead, not reported

Expected behavior

On Apple M4 with Metal acceleration and model fully loaded in GPU,
visual encoding should take 2-4 seconds, not 20.

Notes

When the same image is sent twice in rapid succession (within seconds),
the second call returns in ~0.7s due to visual cache hit. This confirms
the M4 is capable of fast inference — the problem is the first encoding
of any new image takes ~20s with no way to pre-warm with a different image.

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @vpoma777 on GitHub (Mar 4, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14629 ### What is the issue? ## Environment - Hardware: Mac mini M4 16GB unified memory - OS: macOS - Ollama version: 0.17.6 - Model: llama3.2-vision:11b (running 100% GPU via Metal) ## Problem Every single inference call takes ~20 seconds regardless of any optimization attempted. The visual encoder latency is not reported in the API response breakdown, making it invisible but dominant. ## What I tested - Image sizes: 24KB to 267KB → no difference - Image resolutions: 560x560 to 1920x1080 → no difference - OLLAMA_FLASH_ATTENTION=1 → no difference - OLLAMA_KV_CACHE_TYPE=q8_0 → no difference - Warm cache vs cold cache → no difference - Killed duplicate Ollama instance (app + CLI conflict) → no difference ## API response breakdown total_duration: ~20s load_duration: ~0.1s prompt_eval_duration: ~0.75s eval_duration: ~0.46s ───────────────────────── Accounted for: ~1.3s Unaccounted: ~19s ← visual encoder overhead, not reported ## Expected behavior On Apple M4 with Metal acceleration and model fully loaded in GPU, visual encoding should take 2-4 seconds, not 20. ## Notes When the same image is sent twice in rapid succession (within seconds), the second call returns in ~0.7s due to visual cache hit. This confirms the M4 is capable of fast inference — the problem is the first encoding of any new image takes ~20s with no way to pre-warm with a different image. ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-22 19:37:27 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#35238