[GH-ISSUE #15453] Ollama Cloud Pro: 95% failure rate across all cloud models — service is unusable #9878

Open
opened 2026-04-12 22:44:27 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @KUANKEI721 on GitHub (Apr 9, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15453

Originally assigned to: @jmorganca on GitHub.

Environment

  • Plan: Ollama Pro ($20/month, subscribed 2026-04-09)
  • OS: macOS (Darwin 25.4.0)
  • Ollama version: Latest (via brew)
  • Connection: Stable internet, 0% packet loss to ollama.com
  • Models tested: glm-5.1:cloud, kimi-k2.5:cloud, qwen3.5:cloud, deepseek-v3.2:cloud

Problem

Ollama Cloud is effectively unusable. Both /api/chat and /api/generate endpoints return empty responses or timeout for all cloud models. This is not model-specific — every single cloud model exhibits the same behavior.

Reproduction

Simple test — 5 sequential requests per model, 20-second timeout:

for model in glm-5.1:cloud kimi-k2.5:cloud qwen3.5:cloud deepseek-v3.2:cloud; do
  echo "=== $model ==="
  for i in 1 2 3 4 5; do
    curl -s --max-time 20 http://localhost:11434/api/chat \
      -d "{\"model\":\"$model\",\"messages\":[{\"role\":\"user\",\"content\":\"hi\"}],\"stream\":false}" \
      | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'#{'"$i"'} OK {d[\"total_duration\"]/1e9:.1f}s')" 2>/dev/null \
      || echo "#$i FAIL (empty/timeout)"
  done
done

Results (2026-04-09, ~21:00 UTC+8)

Model Success Rate Notes
glm-5.1:cloud 0/5 All empty/timeout
kimi-k2.5:cloud 1/5 1 success (2.6s), 4 failures
qwen3.5:cloud 0/5 All empty/timeout
deepseek-v3.2:cloud 0/5 All empty/timeout
Total 1/20 (5%)

Earlier in the day, glm-5.1:cloud worked intermittently (2/3 success), so this appears to be a degrading situation.

Both endpoints affected

Tested /api/generate as well — same 0/5 failure rate for glm-5.1:cloud. This rules out a /api/chat-specific bug.

Expected behavior

As a paying Pro subscriber ($20/month), I expect a reasonable success rate (>95%) for cloud model inference. A 5% success rate is not a degraded service — it is a broken service.

What I've ruled out

  • Local Ollama service is running (localhost:11434 responds, ollama list shows all cloud models)
  • Network is stable (non-cloud local models work fine)
  • Not a single-model issue (all 4 cloud models fail)
  • Not an endpoint issue (/api/chat and /api/generate both fail)
  • Tested with minimal payloads ("hi") — not a token limit issue

This aligns with multiple existing reports:

  • #15419 — Frequent 503 errors on cloud models (2026-04-08, 7+ confirmations)
  • #14673 — 29.7% failure rate documented, support tickets ignored 2+ weeks
  • #15290 — EOF errors and socket closures on cloud models

Requests

  1. Acknowledge the outage — There is no status page, no incident communication, and no response on existing issues
  2. Provide a status page for Ollama Cloud service health
  3. Add Retry-After headers on 503/502 responses so clients can implement proper backoff
  4. Consider pro-rating or extending subscriptions for periods of sustained outage — charging $20/month for a 5% success rate is not acceptable
Originally created by @KUANKEI721 on GitHub (Apr 9, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15453 Originally assigned to: @jmorganca on GitHub. ## Environment - **Plan**: Ollama Pro ($20/month, subscribed 2026-04-09) - **OS**: macOS (Darwin 25.4.0) - **Ollama version**: Latest (via `brew`) - **Connection**: Stable internet, 0% packet loss to ollama.com - **Models tested**: `glm-5.1:cloud`, `kimi-k2.5:cloud`, `qwen3.5:cloud`, `deepseek-v3.2:cloud` ## Problem Ollama Cloud is effectively unusable. Both `/api/chat` and `/api/generate` endpoints return **empty responses or timeout** for all cloud models. This is not model-specific — every single cloud model exhibits the same behavior. ## Reproduction Simple test — 5 sequential requests per model, 20-second timeout: ```bash for model in glm-5.1:cloud kimi-k2.5:cloud qwen3.5:cloud deepseek-v3.2:cloud; do echo "=== $model ===" for i in 1 2 3 4 5; do curl -s --max-time 20 http://localhost:11434/api/chat \ -d "{\"model\":\"$model\",\"messages\":[{\"role\":\"user\",\"content\":\"hi\"}],\"stream\":false}" \ | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'#{'"$i"'} OK {d[\"total_duration\"]/1e9:.1f}s')" 2>/dev/null \ || echo "#$i FAIL (empty/timeout)" done done ``` ### Results (2026-04-09, ~21:00 UTC+8) | Model | Success Rate | Notes | |-------|-------------|-------| | `glm-5.1:cloud` | **0/5** | All empty/timeout | | `kimi-k2.5:cloud` | **1/5** | 1 success (2.6s), 4 failures | | `qwen3.5:cloud` | **0/5** | All empty/timeout | | `deepseek-v3.2:cloud` | **0/5** | All empty/timeout | | **Total** | **1/20 (5%)** | | Earlier in the day, `glm-5.1:cloud` worked intermittently (2/3 success), so this appears to be a degrading situation. ### Both endpoints affected Tested `/api/generate` as well — same 0/5 failure rate for `glm-5.1:cloud`. This rules out a `/api/chat`-specific bug. ## Expected behavior As a paying Pro subscriber ($20/month), I expect a reasonable success rate (>95%) for cloud model inference. A **5% success rate** is not a degraded service — it is a broken service. ## What I've ruled out - ✅ Local Ollama service is running (`localhost:11434` responds, `ollama list` shows all cloud models) - ✅ Network is stable (non-cloud local models work fine) - ✅ Not a single-model issue (all 4 cloud models fail) - ✅ Not an endpoint issue (`/api/chat` and `/api/generate` both fail) - ✅ Tested with minimal payloads (`"hi"`) — not a token limit issue ## Related issues This aligns with multiple existing reports: - #15419 — Frequent 503 errors on cloud models (2026-04-08, 7+ confirmations) - #14673 — 29.7% failure rate documented, support tickets ignored 2+ weeks - #15290 — EOF errors and socket closures on cloud models ## Requests 1. **Acknowledge the outage** — There is no status page, no incident communication, and no response on existing issues 2. **Provide a status page** for Ollama Cloud service health 3. **Add `Retry-After` headers** on 503/502 responses so clients can implement proper backoff 4. **Consider pro-rating or extending subscriptions** for periods of sustained outage — charging $20/month for a 5% success rate is not acceptable
GiteaMirror added the cloud label 2026-04-12 22:44:27 -05:00
Author
Owner

@bartlomiejwolk commented on GitHub (Apr 9, 2026):

Same for me. I'm a new Ollama Pro user and I started thinking that this is the normal. I'm glad it's not.

<!-- gh-comment-id:4215448317 --> @bartlomiejwolk commented on GitHub (Apr 9, 2026): Same for me. I'm a new Ollama Pro user and I started thinking that this is the normal. I'm glad it's not.
Author
Owner

@dongluochen commented on GitHub (Apr 10, 2026):

@KUANKEI721 thanks for reporting the issues. Sorry about the experiences. Can you re-run your requests and provide the time and request ids (like below) for us to investigate.

Internal Server Error (ref: 4f8b6a2c-a0ec-474e-b37a-6542b4ea732e)
<!-- gh-comment-id:4218782831 --> @dongluochen commented on GitHub (Apr 10, 2026): @KUANKEI721 thanks for reporting the issues. Sorry about the experiences. Can you re-run your requests and provide the time and request ids (like below) for us to investigate. ``` Internal Server Error (ref: 4f8b6a2c-a0ec-474e-b37a-6542b4ea732e) ```
Author
Owner

@jmorganca commented on GitHub (Apr 10, 2026):

Hi all I'm sorry for the issues with Ollama's cloud this morning. We've been working hard to increase capacity. It should be improving now and we'll continue to monitor it.

<!-- gh-comment-id:4218854979 --> @jmorganca commented on GitHub (Apr 10, 2026): Hi all I'm sorry for the issues with Ollama's cloud this morning. We've been working hard to increase capacity. It should be improving now and we'll continue to monitor it.
Author
Owner

@KUANKEI721 commented on GitHub (Apr 10, 2026):

@dongluochen Thanks for the quick response and for looking into this.

We don't have a ref because these failures are not 500 Internal Server Error — they are 502 Bad Gateway with an empty response body, so no app-level error UUID is returned.

Per your own API error documentation, 502 is defined as:

502: Bad Gateway (e.g. when a cloud model cannot be reached)

This is a server-side cloud routing issue, not a client-side problem. The requests never reached the application layer that generates ref UUIDs — they failed at the gateway.


Representative failed samples

From our local ~/.ollama/logs/server.log (all times UTC+8, all requests were sequential — one at a time, no concurrency):

# Time (UTC+8) Endpoint Status Latency
1 2026-04-09 17:29:14 /v1/chat/completions 502 5m0s
2 2026-04-09 21:35:21 /api/chat 502 20.0s
3 2026-04-09 22:09:35 /api/chat 502 20.0s
4 2026-04-09 22:10:44 /api/generate 502 2m0s
5 2026-04-10 02:11:42 /v1/chat/completions 502 5.0s
Full list: 48 total 502s (39 on 04-09, 9 on 04-10)
[GIN] 2026/04/09 - 17:29:14 | 502 |          5m0s | POST "/v1/chat/completions"
[GIN] 2026/04/09 - 17:34:16 | 502 |          5m0s | POST "/v1/chat/completions"
[GIN] 2026/04/09 - 21:10:48 | 502 |  30.00136325s | POST "/api/chat"
[GIN] 2026/04/09 - 21:35:21 | 502 | 20.004127292s | POST "/api/chat"
[GIN] 2026/04/09 - 21:35:41 | 502 | 20.002606875s | POST "/api/chat"
[GIN] 2026/04/09 - 21:35:48 | 502 |  7.249221167s | POST "/api/chat"
[GIN] 2026/04/09 - 21:45:08 | 502 |  20.00288025s | POST "/api/chat"
[GIN] 2026/04/09 - 21:46:14 | 502 | 20.002244458s | POST "/api/chat"
[GIN] 2026/04/09 - 21:47:23 | 502 | 20.001394208s | POST "/api/chat"
[GIN] 2026/04/09 - 21:48:05 | 502 | 20.002511625s | POST "/api/chat"
[GIN] 2026/04/09 - 22:08:35 | 502 | 20.002653625s | POST "/api/generate"
[GIN] 2026/04/09 - 22:08:55 | 502 | 20.002629833s | POST "/api/generate"
[GIN] 2026/04/09 - 22:09:15 | 502 |    20.002901s | POST "/api/generate"
[GIN] 2026/04/09 - 22:09:35 | 502 |  20.00160475s | POST "/api/chat"
[GIN] 2026/04/09 - 22:09:55 | 502 | 20.002261209s | POST "/api/chat"
[GIN] 2026/04/09 - 22:10:15 | 502 | 20.002600167s | POST "/api/chat"
[GIN] 2026/04/09 - 22:10:35 | 502 | 20.001014458s | POST "/api/chat"
[GIN] 2026/04/09 - 22:10:44 | 502 |          2m0s | POST "/api/generate" (×9)
[GIN] 2026/04/09 - 22:10:55 | 502 |   20.0013255s | POST "/api/chat"
[GIN] 2026/04/09 - 22:12:44 | 502 |          2m0s | POST "/api/generate" (×9)
[GIN] 2026/04/09 - 22:15:59 | 502 |          3m0s | POST "/api/generate"
[GIN] 2026/04/09 - 22:17:05 | 502 |          1m0s | POST "/api/generate"
[GIN] 2026/04/09 - 22:17:35 | 502 | 15.002056084s | POST "/api/generate"
[GIN] 2026/04/10 - 02:11:42 | 502 |  4.988163084s | POST "/v1/chat/completions"
[GIN] 2026/04/10 - 13:13:40 | 502 | 20.003288167s | POST "/api/chat"
[GIN] 2026/04/10 - 13:14:10 | 502 | 20.002086333s | POST "/api/chat"
[GIN] 2026/04/10 - 13:14:37 | 502 | 20.002178709s | POST "/api/chat"
[GIN] 2026/04/10 - 13:17:58 | 502 | 20.001516875s | POST "/api/chat"
[GIN] 2026/04/10 - 13:19:36 | 502 | 20.002368333s | POST "/api/chat"
[GIN] 2026/04/10 - 13:22:20 | 502 | 20.003383167s | POST "/api/chat"
[GIN] 2026/04/10 - 13:24:03 | 502 | 20.003107333s | POST "/api/chat"
[GIN] 2026/04/10 - 13:24:23 | 502 | 20.003951959s | POST "/api/chat"

What the 502 responses look like vs. successful ones

Successful requests return rich tracing headers:

Server: Google Frontend
X-Request-Id: 39a39f87-d82e-45dd-8408-0d95692b876e
Traceparent: 00-86d89d6bfb1481d41f5799ef8470d8b6-a91d7a5e4b265d16-00
X-Cloud-Trace-Context: 86d89d6bfb1481d41f5799ef8470d8b6/12186030712140750102

The 502 failures return empty body, no headers — there is nothing client-side to provide beyond the timestamps above.


As of 2026-04-10 ~05:30 UTC, the issue is no longer reproducing after the capacity improvements @jmorganca mentioned. This is consistent with a transient capacity/routing incident on the cloud backend.

Could you investigate the 502s using these UTC+8 timestamps against your edge/gateway logs? Happy to provide our account email privately if that helps correlate.

<!-- gh-comment-id:4221506658 --> @KUANKEI721 commented on GitHub (Apr 10, 2026): @dongluochen Thanks for the quick response and for looking into this. We don't have a `ref` because these failures are **not** `500 Internal Server Error` — they are `502 Bad Gateway` with an **empty response body**, so no app-level error UUID is returned. Per your own [API error documentation](https://github.com/ollama/ollama/blob/main/docs/api/errors.mdx), `502` is defined as: > **502**: Bad Gateway (e.g. when a cloud model cannot be reached) This is a server-side cloud routing issue, not a client-side problem. The requests never reached the application layer that generates `ref` UUIDs — they failed at the gateway. --- ### Representative failed samples From our local `~/.ollama/logs/server.log` (all times UTC+8, **all requests were sequential — one at a time, no concurrency**): | # | Time (UTC+8) | Endpoint | Status | Latency | |---|-------------|----------|--------|---------| | 1 | 2026-04-09 17:29:14 | `/v1/chat/completions` | 502 | 5m0s | | 2 | 2026-04-09 21:35:21 | `/api/chat` | 502 | 20.0s | | 3 | 2026-04-09 22:09:35 | `/api/chat` | 502 | 20.0s | | 4 | 2026-04-09 22:10:44 | `/api/generate` | 502 | 2m0s | | 5 | 2026-04-10 02:11:42 | `/v1/chat/completions` | 502 | 5.0s | <details> <summary>Full list: 48 total 502s (39 on 04-09, 9 on 04-10)</summary> ``` [GIN] 2026/04/09 - 17:29:14 | 502 | 5m0s | POST "/v1/chat/completions" [GIN] 2026/04/09 - 17:34:16 | 502 | 5m0s | POST "/v1/chat/completions" [GIN] 2026/04/09 - 21:10:48 | 502 | 30.00136325s | POST "/api/chat" [GIN] 2026/04/09 - 21:35:21 | 502 | 20.004127292s | POST "/api/chat" [GIN] 2026/04/09 - 21:35:41 | 502 | 20.002606875s | POST "/api/chat" [GIN] 2026/04/09 - 21:35:48 | 502 | 7.249221167s | POST "/api/chat" [GIN] 2026/04/09 - 21:45:08 | 502 | 20.00288025s | POST "/api/chat" [GIN] 2026/04/09 - 21:46:14 | 502 | 20.002244458s | POST "/api/chat" [GIN] 2026/04/09 - 21:47:23 | 502 | 20.001394208s | POST "/api/chat" [GIN] 2026/04/09 - 21:48:05 | 502 | 20.002511625s | POST "/api/chat" [GIN] 2026/04/09 - 22:08:35 | 502 | 20.002653625s | POST "/api/generate" [GIN] 2026/04/09 - 22:08:55 | 502 | 20.002629833s | POST "/api/generate" [GIN] 2026/04/09 - 22:09:15 | 502 | 20.002901s | POST "/api/generate" [GIN] 2026/04/09 - 22:09:35 | 502 | 20.00160475s | POST "/api/chat" [GIN] 2026/04/09 - 22:09:55 | 502 | 20.002261209s | POST "/api/chat" [GIN] 2026/04/09 - 22:10:15 | 502 | 20.002600167s | POST "/api/chat" [GIN] 2026/04/09 - 22:10:35 | 502 | 20.001014458s | POST "/api/chat" [GIN] 2026/04/09 - 22:10:44 | 502 | 2m0s | POST "/api/generate" (×9) [GIN] 2026/04/09 - 22:10:55 | 502 | 20.0013255s | POST "/api/chat" [GIN] 2026/04/09 - 22:12:44 | 502 | 2m0s | POST "/api/generate" (×9) [GIN] 2026/04/09 - 22:15:59 | 502 | 3m0s | POST "/api/generate" [GIN] 2026/04/09 - 22:17:05 | 502 | 1m0s | POST "/api/generate" [GIN] 2026/04/09 - 22:17:35 | 502 | 15.002056084s | POST "/api/generate" [GIN] 2026/04/10 - 02:11:42 | 502 | 4.988163084s | POST "/v1/chat/completions" [GIN] 2026/04/10 - 13:13:40 | 502 | 20.003288167s | POST "/api/chat" [GIN] 2026/04/10 - 13:14:10 | 502 | 20.002086333s | POST "/api/chat" [GIN] 2026/04/10 - 13:14:37 | 502 | 20.002178709s | POST "/api/chat" [GIN] 2026/04/10 - 13:17:58 | 502 | 20.001516875s | POST "/api/chat" [GIN] 2026/04/10 - 13:19:36 | 502 | 20.002368333s | POST "/api/chat" [GIN] 2026/04/10 - 13:22:20 | 502 | 20.003383167s | POST "/api/chat" [GIN] 2026/04/10 - 13:24:03 | 502 | 20.003107333s | POST "/api/chat" [GIN] 2026/04/10 - 13:24:23 | 502 | 20.003951959s | POST "/api/chat" ``` </details> ### What the 502 responses look like vs. successful ones Successful requests return rich tracing headers: ``` Server: Google Frontend X-Request-Id: 39a39f87-d82e-45dd-8408-0d95692b876e Traceparent: 00-86d89d6bfb1481d41f5799ef8470d8b6-a91d7a5e4b265d16-00 X-Cloud-Trace-Context: 86d89d6bfb1481d41f5799ef8470d8b6/12186030712140750102 ``` The 502 failures return **empty body, no headers** — there is nothing client-side to provide beyond the timestamps above. --- As of 2026-04-10 ~05:30 UTC, the issue is no longer reproducing after the capacity improvements @jmorganca mentioned. This is consistent with a transient capacity/routing incident on the cloud backend. Could you investigate the 502s using these UTC+8 timestamps against your edge/gateway logs? Happy to provide our account email privately if that helps correlate.
Author
Owner

@matholland618 commented on GitHub (Apr 11, 2026):

same for me....I noticed this yesterday, maybe late wed. evening. I noticed it when I changed my hermes model to glm 5.1 cloud...thought it was an issue with that model, it would just freeze up during tasks, or not respond at all....then I went back to qwen 3.5, and it's doing the same thing..

<!-- gh-comment-id:4228298203 --> @matholland618 commented on GitHub (Apr 11, 2026): same for me....I noticed this yesterday, maybe late wed. evening. I noticed it when I changed my hermes model to glm 5.1 cloud...thought it was an issue with that model, it would just freeze up during tasks, or not respond at all....then I went back to qwen 3.5, and it's doing the same thing..
Author
Owner

@orrinwitt commented on GitHub (Apr 12, 2026):

I just switched my nanobot using glm-5.1 from ollama cloud to openrouter and still got the error. maybe it's the upstream providers?

<!-- gh-comment-id:4232639237 --> @orrinwitt commented on GitHub (Apr 12, 2026): I just switched my nanobot using glm-5.1 from ollama cloud to openrouter and still got the error. maybe it's the upstream providers?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9878