[GH-ISSUE #15258] Ollama 0.20.0: /v1/chat/completions hangs indefinitely on Apple Silicon M4 #71817

Closed
opened 2026-05-05 02:37:25 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @vikramwalia on GitHub (Apr 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15258

What is the issue?

Ollama 0.20.0: /v1/chat/completions hangs indefinitely on Apple Silicon M4

Environment

  • Hardware: Mac Mini M4 (32GB unified memory)
  • OS: macOS (Apple Silicon arm64)
  • Ollama: 0.20.0 GA (both Homebrew bottle and official ollama-darwin.tgz from GitHub releases)
  • Previous working version: 0.20.0-rc1 (installed via curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.20.0-rc1 sh)

Bug Summary
The OpenAI-compatible /v1/chat/completions endpoint hangs indefinitely for all generative models on Ollama 0.20.0 GA running on Apple Silicon M4. The request is accepted (TCP connection established, POST sent) but zero bytes are ever returned. After curl's timeout, the server logs a 499 (client closed connection).

Additionally, we discovered that /api/chat and /api/generate (native endpoints) are ALSO broken on 0.20.0 GA — they exhibit the same hang behavior. The runner process spawns, loads the model successfully, but produces no output.

What works

  • /api/version — responds instantly
  • /api/tags — lists models correctly
  • /api/ps — shows loaded models
  • /api/pull — pulls models successfully
  • /api/embeddings — nomic-embed-text returns 768-dim vectors in ~30ms
  • /v1/embeddings — also works perfectly
  • What's broken
  • /v1/chat/completions — hangs, 0 bytes, eventually 499
  • /api/chat — hangs, 0 bytes
  • /api/generate — hangs, 0 bytes (both stream:true and stream:false)

Models tested (all fail)

  • gemma4:e2b (7.2GB)
  • gemma4:26b (18GB)
  • qwen3-vl:8b (6.1GB)
  • qwen3.5:9b (6.6GB)

Reproduction

# Server running with:
# OLLAMA_HOST=0.0.0.0:11434
# OLLAMA_FLASH_ATTENTION=1

# This hangs forever (or until timeout):
curl -m 60 http://localhost:11434/api/generate \
  -d '{"model":"gemma4:e2b","prompt":"Say hello","stream":false}'
# Returns: empty (0 bytes after 60s)

# This also hangs:
curl -m 60 http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma4:e2b","messages":[{"role":"user","content":"Say hello"}],"max_tokens":10}'
# Returns: empty (0 bytes after 60s)

# But this works instantly:
curl http://localhost:11434/api/embeddings \
  -d '{"model":"nomic-embed-text","prompt":"test"}'
# Returns: 768-dim embedding vector

Server logs during the hang

# Error log shows model loads successfully:
level=INFO source=server.go:1390 msg="llama runner started in 1.62 seconds"

# But the runner process consumes 200-380% CPU indefinitely:
vsw  64149 384.6 28.8 449640000 9672752 ?? R /opt/homebrew/Cellar/ollama/0.20.0/bin/ollama runner ...

# Serve log shows the request eventually times out:
[GIN] 2026/04/02 - 21:46:51 | 499 | 3m16s | ::1 | POST "/v1/chat/completions"

Troubleshooting performed

  1. Stripped all env vars (removed OLLAMA_FLASH_ATTENTION, OLLAMA_KV_CACHE_TYPE, OLLAMA_NUM_PARALLEL) — same behavior
  2. Tested with bare OLLAMA_HOST=0.0.0.0:11434 only — same behavior
  3. Tested locally (127.0.0.1) and over LAN (10.0.3.161) — same behavior
  4. Tested both Homebrew bottle and official darwin tgz — same behavior
  5. Re-pulled models (/api/pull) — models pull successfully, still can't generate
  6. Tested streaming mode (stream:true) — also hangs, no output
  7. Confirmed non-multimodal models (qwen3.5:9b) also hang — not Gemma-specific

Working configuration
Reverted to Ollama 0.19.0 (ollama-darwin.tgz from GitHub releases v0.19.0). All native endpoints work. /v1/chat/completions works. Tool calling works. However, 0.19.0 does not support Gemma 4 models (500 error on load).

Key observation
0.20.0-rc1 worked perfectly on the same hardware with the same models and same configuration. The rc1 was installed via the direct install script with OLLAMA_VERSION=0.20.0-rc1 and ran Gemma 4 models with native /api/chat, /api/generate, tool calling, and even /v1/chat/completions (though /v1 was slower). The GA release introduced a regression between rc1 and the final 0.20.0 build.

Impact
This blocks usage of Gemma 4 models on Apple Silicon M4, since:

  • 0.19.0 doesn't support Gemma 4
  • 0.20.0 can't generate any output

Comparison: M1 Mac Mini works fine
On a Mac Mini M1 (16GB) running Ollama 0.18.2, all endpoints including /v1/chat/completions work correctly with llama3.1:8b. This issue appears specific to the 0.20.0 GA build on Apple Silicon (at minimum M4, untested on M1 with 0.20.0).

Relevant log output


OS

MacOS

GPU

Apple M4

CPU

Apple M4

Ollama version

0.20.0

Originally created by @vikramwalia on GitHub (Apr 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15258 ### What is the issue? Ollama 0.20.0: /v1/chat/completions hangs indefinitely on Apple Silicon M4 Environment - Hardware: Mac Mini M4 (32GB unified memory) - OS: macOS (Apple Silicon arm64) - Ollama: 0.20.0 GA (both Homebrew bottle and official ollama-darwin.tgz from GitHub releases) - Previous working version: 0.20.0-rc1 (installed via curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.20.0-rc1 sh) Bug Summary The OpenAI-compatible /v1/chat/completions endpoint hangs indefinitely for all generative models on Ollama 0.20.0 GA running on Apple Silicon M4. The request is accepted (TCP connection established, POST sent) but zero bytes are ever returned. After curl's timeout, the server logs a 499 (client closed connection). Additionally, we discovered that /api/chat and /api/generate (native endpoints) are ALSO broken on 0.20.0 GA — they exhibit the same hang behavior. The runner process spawns, loads the model successfully, but produces no output. What works - /api/version — responds instantly - /api/tags — lists models correctly - /api/ps — shows loaded models - /api/pull — pulls models successfully - /api/embeddings — nomic-embed-text returns 768-dim vectors in ~30ms - /v1/embeddings — also works perfectly - What's broken - /v1/chat/completions — hangs, 0 bytes, eventually 499 - /api/chat — hangs, 0 bytes - /api/generate — hangs, 0 bytes (both stream:true and stream:false) Models tested (all fail) - gemma4:e2b (7.2GB) - gemma4:26b (18GB) - qwen3-vl:8b (6.1GB) - qwen3.5:9b (6.6GB) Reproduction ``` # Server running with: # OLLAMA_HOST=0.0.0.0:11434 # OLLAMA_FLASH_ATTENTION=1 # This hangs forever (or until timeout): curl -m 60 http://localhost:11434/api/generate \ -d '{"model":"gemma4:e2b","prompt":"Say hello","stream":false}' # Returns: empty (0 bytes after 60s) # This also hangs: curl -m 60 http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"gemma4:e2b","messages":[{"role":"user","content":"Say hello"}],"max_tokens":10}' # Returns: empty (0 bytes after 60s) # But this works instantly: curl http://localhost:11434/api/embeddings \ -d '{"model":"nomic-embed-text","prompt":"test"}' # Returns: 768-dim embedding vector ``` Server logs during the hang ``` # Error log shows model loads successfully: level=INFO source=server.go:1390 msg="llama runner started in 1.62 seconds" # But the runner process consumes 200-380% CPU indefinitely: vsw 64149 384.6 28.8 449640000 9672752 ?? R /opt/homebrew/Cellar/ollama/0.20.0/bin/ollama runner ... # Serve log shows the request eventually times out: [GIN] 2026/04/02 - 21:46:51 | 499 | 3m16s | ::1 | POST "/v1/chat/completions" ``` Troubleshooting performed 1. Stripped all env vars (removed OLLAMA_FLASH_ATTENTION, OLLAMA_KV_CACHE_TYPE, OLLAMA_NUM_PARALLEL) — same behavior 2. Tested with bare OLLAMA_HOST=0.0.0.0:11434 only — same behavior 3. Tested locally (127.0.0.1) and over LAN (10.0.3.161) — same behavior 4. Tested both Homebrew bottle and official darwin tgz — same behavior 5. Re-pulled models (/api/pull) — models pull successfully, still can't generate 6. Tested streaming mode (stream:true) — also hangs, no output 7. Confirmed non-multimodal models (qwen3.5:9b) also hang — not Gemma-specific Working configuration Reverted to Ollama 0.19.0 (ollama-darwin.tgz from GitHub releases v0.19.0). All native endpoints work. /v1/chat/completions works. Tool calling works. However, 0.19.0 does not support Gemma 4 models (500 error on load). Key observation 0.20.0-rc1 worked perfectly on the same hardware with the same models and same configuration. The rc1 was installed via the direct install script with OLLAMA_VERSION=0.20.0-rc1 and ran Gemma 4 models with native /api/chat, /api/generate, tool calling, and even /v1/chat/completions (though /v1 was slower). The GA release introduced a regression between rc1 and the final 0.20.0 build. Impact This blocks usage of Gemma 4 models on Apple Silicon M4, since: - 0.19.0 doesn't support Gemma 4 - 0.20.0 can't generate any output Comparison: M1 Mac Mini works fine On a Mac Mini M1 (16GB) running Ollama 0.18.2, all endpoints including /v1/chat/completions work correctly with llama3.1:8b. This issue appears specific to the 0.20.0 GA build on Apple Silicon (at minimum M4, untested on M1 with 0.20.0). ### Relevant log output ```shell ``` ### OS MacOS ### GPU Apple M4 ### CPU Apple M4 ### Ollama version 0.20.0
GiteaMirror added the bug label 2026-05-05 02:37:25 -05:00
Author
Owner

@vikramwalia commented on GitHub (Apr 4, 2026):

This is resolved with the 0.20.2 update. Thank you !

<!-- gh-comment-id:4187428716 --> @vikramwalia commented on GitHub (Apr 4, 2026): This is resolved with the 0.20.2 update. Thank you !
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71817