[GH-ISSUE #13899] /v1/chat/completions endpoint 40-50x slower over network #9094

Closed
opened 2026-04-12 21:56:44 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @LyudmilArkov on GitHub (Jan 25, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13899

What is the issue?

/v1/chat/completions endpoint 40-50x slower over network

Summary

The /v1/chat/completions endpoint has severe performance degradation when accessed over network (40-50x slower than local access or native /api/generate), breaking Claude Code and OpenCode integrations per official docs.

Environment

  • Ollama: 0.14.3
  • Server: Ubuntu 24.04, AMD Ryzen 5 (6c/12t), 32GB RAM, CPU-only
  • Network: LAN (same subnet)

Models Tested

Both exhibit same issue:

  • qwen3-coder:30b (Q4_K_M quantization)
  • gpt-oss:20b

Model Configurations Tested

Multiple variants with context sizes and thread counts:

  • Context: 32K, 64K, 128K tokens
  • Threads: 6, 12
  • Batch size: 512

All variants show same remote slowness

Official Integration Docs Followed

Both integrations fail due to this performance issue.

Reproduction

Local (works ):

# On server
time ollama run gpt-oss:20b "Write hello world in Python"
# ~15-20 seconds

Remote native API (works ):

# From remote client
curl http://SERVER:11434/api/generate -d '{
  "model": "gpt-oss:20b",
  "prompt": "Write hello world in Python",
  "stream": false
}'
# ~20 seconds
# 72 prompt tokens, 92 completion tokens

Remote OpenAI-compatible API (broken ):

# From remote client
curl -X POST http://SERVER:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss:20b",
    "messages": [{"role": "user", "content": "Write hello world in Python"}],
    "max_tokens": 500,
    "stream": false
  }'
# ~14 minutes
# 72 prompt tokens, 140 completion tokens

Performance Data

Endpoint Location Time Tokens/sec Status
ollama run Local 15s 20-30
/api/generate Remote 20s 15-20
/v1/chat/completions Local 20s 15-20
/v1/chat/completions Remote 14m 0.3-0.5

Same prompt, same model, same tokens - only remote /v1/chat/completions is 40-50x slower.

Impact

Breaks tools requiring OpenAI-compatible format:

  • Claude Code (official Anthropic integration)
  • OpenCode (official integration)
  • Any OpenAI SDK client accessing Ollama remotely

Tools using native API work fine:

  • Apollo (iOS)
  • Reins (iOS)

Attempted Workarounds

  • LiteLLM proxy: Same issue (uses /v1/chat/completions under hood)
  • Streaming disabled: "stream": false has no effect
  • Different models/contexts: All affected equally
  • Thread optimization: No improvement for remote calls

Server Logs

[GIN] | 500 | 9m26s | CLIENT_IP | POST "/v1/chat/completions"

No error details, just long processing time followed by 500.

Network Verification

  • /v1/models endpoint responds instantly
  • Native /api/generate performs normally
  • No firewalls/proxies between client and server
  • OLLAMA_HOST=0.0.0.0:11434

Hypothesis

The /v1/chat/completions endpoint likely has per-token network overhead or inefficient buffering that doesn't affect local loopback connections or the native API.

Relevant log output


OS

Linux

GPU

No response

CPU

AMD

Ollama version

0.14.3

Originally created by @LyudmilArkov on GitHub (Jan 25, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13899 ### What is the issue? # `/v1/chat/completions` endpoint 40-50x slower over network ## Summary The `/v1/chat/completions` endpoint has severe performance degradation when accessed over network (40-50x slower than local access or native `/api/generate`), breaking Claude Code and OpenCode integrations per official docs. ## Environment - **Ollama**: 0.14.3 - **Server**: Ubuntu 24.04, AMD Ryzen 5 (6c/12t), 32GB RAM, CPU-only - **Network**: LAN (same subnet) ## Models Tested Both exhibit same issue: - `qwen3-coder:30b` (Q4_K_M quantization) - `gpt-oss:20b` ## Model Configurations Tested Multiple variants with context sizes and thread counts: - Context: 32K, 64K, 128K tokens - Threads: 6, 12 - Batch size: 512 All variants show same remote slowness ## Official Integration Docs Followed - https://docs.ollama.com/integrations/claude-code - https://docs.ollama.com/integrations/opencode Both integrations fail due to this performance issue. ## Reproduction ### Local (works ✅): ```bash # On server time ollama run gpt-oss:20b "Write hello world in Python" # ~15-20 seconds ``` ### Remote native API (works ✅): ```bash # From remote client curl http://SERVER:11434/api/generate -d '{ "model": "gpt-oss:20b", "prompt": "Write hello world in Python", "stream": false }' # ~20 seconds # 72 prompt tokens, 92 completion tokens ``` ### Remote OpenAI-compatible API (broken ❌): ```bash # From remote client curl -X POST http://SERVER:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-oss:20b", "messages": [{"role": "user", "content": "Write hello world in Python"}], "max_tokens": 500, "stream": false }' # ~14 minutes # 72 prompt tokens, 140 completion tokens ``` ## Performance Data | Endpoint | Location | Time | Tokens/sec | Status | |----------|----------|------|------------|--------| | `ollama run` | Local | 15s | 20-30 | ✅ | | `/api/generate` | Remote | 20s | 15-20 | ✅ | | `/v1/chat/completions` | Local | 20s | 15-20 | ✅ | | `/v1/chat/completions` | Remote | 14m | 0.3-0.5 | ❌ | Same prompt, same model, same tokens - only remote `/v1/chat/completions` is 40-50x slower. ## Impact Breaks tools requiring OpenAI-compatible format: - Claude Code (official Anthropic integration) - OpenCode (official integration) - Any OpenAI SDK client accessing Ollama remotely Tools using native API work fine: - Apollo (iOS) - Reins (iOS) ## Attempted Workarounds - **LiteLLM proxy**: Same issue (uses `/v1/chat/completions` under hood) - **Streaming disabled**: `"stream": false` has no effect - **Different models/contexts**: All affected equally - **Thread optimization**: No improvement for remote calls ## Server Logs ``` [GIN] | 500 | 9m26s | CLIENT_IP | POST "/v1/chat/completions" ``` No error details, just long processing time followed by 500. ## Network Verification - `/v1/models` endpoint responds instantly - Native `/api/generate` performs normally - No firewalls/proxies between client and server - `OLLAMA_HOST=0.0.0.0:11434` ## Hypothesis The `/v1/chat/completions` endpoint likely has per-token network overhead or inefficient buffering that doesn't affect local loopback connections or the native API. ### Relevant log output ```shell ``` ### OS Linux ### GPU _No response_ ### CPU AMD ### Ollama version 0.14.3
GiteaMirror added the bug label 2026-04-12 21:56:44 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 25, 2026):

Server log may help in debugging.

$ ssh $SERVER ollama -v
ollama version is 0.14.3

$ payload='{"model": "gpt-oss:20b","messages": [{"role": "user", "content": "Write hello world in Python"}],"max_tokens": 500,"stream": false}'

$ time curl -s $SERVER:11434/v1/chat/completions -d "$payload" | jq -c .usage
{"prompt_tokens":72,"completion_tokens":202,"total_tokens":274}

real	0m1.356s
user	0m0.010s
sys	0m0.015s

$ time ssh $SERVER curl -s localhost:11434/v1/chat/completions -d "'$payload'" | jq -c .usage
{"prompt_tokens":72,"completion_tokens":111,"total_tokens":183}

real	0m1.065s
user	0m0.023s
sys	0m0.015s
<!-- gh-comment-id:3796583065 --> @rick-github commented on GitHub (Jan 25, 2026): [Server log](https://docs.ollama.com/troubleshooting) may help in debugging. ```console $ ssh $SERVER ollama -v ollama version is 0.14.3 $ payload='{"model": "gpt-oss:20b","messages": [{"role": "user", "content": "Write hello world in Python"}],"max_tokens": 500,"stream": false}' $ time curl -s $SERVER:11434/v1/chat/completions -d "$payload" | jq -c .usage {"prompt_tokens":72,"completion_tokens":202,"total_tokens":274} real 0m1.356s user 0m0.010s sys 0m0.015s $ time ssh $SERVER curl -s localhost:11434/v1/chat/completions -d "'$payload'" | jq -c .usage {"prompt_tokens":72,"completion_tokens":111,"total_tokens":183} real 0m1.065s user 0m0.023s sys 0m0.015s ```
Author
Owner

@LyudmilArkov commented on GitHub (Jan 25, 2026):

Update: Detailed testing reveals multiple distinct issues.

Test Setup

All tests switched to base gpt-oss:20b model on:

  • Server: Ollama 0.14.3, Ubuntu 24.04, AMD Ryzen 5 (6c/12t), 32GB RAM, CPU-only
  • Default model config: 4096 context, 6 threads

Test 1: Direct API Calls (Works as expected)

Commands:

payload='{"model": "gpt-oss:20b","messages": [{"role": "user", "content": "Write hello world in Python"}],"max_tokens": 500,"stream": false}'

# Remote call
time curl -s SERVER:11434/v1/chat/completions -d "$payload" | jq -c .usage

# Local call via SSH
time ssh SERVER curl -s localhost:11434/v1/chat/completions -d "'$payload'" | jq -c .usage

Results:

=== Remote call ===
{"prompt_tokens":72,"completion_tokens":135,"total_tokens":207}
real: 22.023s

=== Local call (via SSH) ===
{"prompt_tokens":72,"completion_tokens":160,"total_tokens":232}
real: 22.377s

Server logs:

[GIN] | 200 | 21.990674083s | CLIENT | POST "/v1/chat/completions"
[GIN] | 200 | 18.963631881s | ::1    | POST "/v1/chat/completions"

Conclusion: /v1/chat/completions endpoint works ok. Remote and local performance are identical (~22s).


Test 2: Claude Code (Anthropic API Issues)

Command: Same prompt "Write hello world in Python" via Claude Code TUI

Result: 3m 46s

Server logs:

14:09:33 | 404 |     7.444µs | CLIENT | POST "/v1/messages/count_tokens?beta=true"
14:09:33 | 404 | 4.56157ms   | CLIENT | POST "/v1/messages?beta=true"
14:09:37 | 404 |   769.095µs | CLIENT | POST "/v1/messages?beta=true"
14:09:37 | 404 |   858.846µs | CLIENT | POST "/v1/messages?beta=true"
14:09:37 WARN: truncating input prompt limit=4096 prompt=11090 keep=4 new=4096
14:13:23 | 200 | 3m45s       | CLIENT | POST "/v1/messages?beta=true"
14:13:23 WARN: truncating input prompt limit=4096 prompt=13338 keep=4 new=4096
14:13:37 | 500 | 14.374920968s | CLIENT | POST "/v1/messages?beta=true"

Issues observed:

  1. Multiple 404 errors on /v1/messages endpoint
  2. After retries, eventually returns 200 after 3m45s
  3. Follow-up request fails with 500 error
  4. Sends 11,090-13,338 token prompts for simple "hello world" task
  5. Base model's 4096 context truncates the prompt

Test 3: OpenCode (Slow but Works)

Command: Same prompt "Write hello world in Python" via OpenCode TUI

Result: 6m 46s

Server logs:

14:13:53 WARN: truncating input prompt limit=4096 prompt=6926 keep=4 new=4096
14:20:39 | 200 | 6m46s | CLIENT | POST "/v1/chat/completions"

Issues observed:

  1. Uses /v1/chat/completions (the working endpoint)
  2. Takes 6m46s vs 22s for direct curl (18x slower)
  3. Sends 6,926 token prompt for simple task
  4. Context truncation occurs

Performance Summary

Method Endpoint Time Prompt Size Status
curl (remote) /v1/chat/completions 22s 72 tokens Ok
curl (local) /v1/chat/completions 22s 72 tokens Ok
Claude Code /v1/messages 3m 46s 11K-13K tokens ⚠️ Flaky (404s, 500s)
OpenCode /v1/chat/completions 6m 46s 6.9K tokens ⚠️ Works but slow

Issues Identified

1. /v1/messages Endpoint is Unreliable

The Anthropic Messages API implementation returns frequent 404 and 500 errors, requiring multiple retries.

2. Tools Send Massive Prompts

Both Claude Code and OpenCode load extensive context (6K-13K tokens) even for trivial tasks, vs 72 tokens for direct API calls.

3. Base Model Context Insufficient

Default 4096 context causes prompt truncation, requiring custom model variants with 32K+ context for these tools.

4. Tool Overhead

Even when working, OpenCode is 18x slower than direct API calls (6m46s vs 22s), suggesting significant processing overhead in the tool itself.


Conclusions

Original issue "40-50x slower over network" diagnosis was partially incorrect.

The actual issues are:

  1. /v1/chat/completions works perfectly (identical performance remote vs local)
  2. /v1/messages is unreliable (frequent 404/500 errors, per logs above)
  3. ⚠️ Tool integrations send massive context and have significant overhead
  4. ⚠️ Base models need larger context (32K+) for these tools - this is expected I guess

The /v1/chat/completions endpoint may have no network performance bug. The slowness may be primarily due to:

  • /v1/messages endpoint reliability issues
  • Excessive context loading by the integration tools
  • Tool processing overhead

Actual log


Jan 25 14:08:57 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:08:57 | 200 | 21.990674083s |     SERVER-IP | POST     "/v1/chat/completions"
Jan 25 14:09:20 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:20 | 200 | 18.963631881s |             ::1 | POST     "/v1/chat/completions"
Jan 25 14:09:33 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:33 | 404 |       7.444µs |     SERVER-IP | POST     "/v1/messages/count_tokens?beta=true"
Jan 25 14:09:33 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:33 | 404 |     4.56157ms |     SERVER-IP | POST     "/v1/messages?beta=true"
Jan 25 14:09:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:37 | 404 |     858.846µs |     SERVER-IP | POST     "/v1/messages?beta=true"
Jan 25 14:09:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:37 | 404 |     769.095µs |     SERVER-IP | POST     "/v1/messages?beta=true"
Jan 25 14:09:37 SERVER ollama[80186]: time=2026-01-25T14:09:37.587Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=4096 prompt=11090 keep=4 new=4096
Jan 25 14:09:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:37 | 404 |       4.489µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:09:38 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:38 | 404 |       4.529µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:09:40 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:40 | 404 |       4.799µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:09:44 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:44 | 404 |       4.469µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:09:53 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:53 | 404 |       4.389µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:10:05 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:10:05 | 404 |      28.024µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:10:23 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:10:23 | 404 |       4.248µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:10:48 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:10:48 | 404 |       4.389µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:11:18 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:11:18 | 404 |        5.19µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:11:48 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:11:48 | 404 |       4.729µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:12:18 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:12:18 | 404 |       6.613µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:12:48 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:12:48 | 404 |       5.079µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:13:18 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:18 | 404 |       6.021µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:13:23 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:23 | 200 |         3m45s |     SERVER-IP | POST     "/v1/messages?beta=true"
Jan 25 14:13:23 SERVER ollama[80186]: time=2026-01-25T14:13:23.592Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=4096 prompt=13338 keep=4 new=4096
Jan 25 14:13:28 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:28 | 404 |        5.42µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:13:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:37 | 404 |         5.2µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:13:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:37 | 404 |       4.599µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:13:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:37 | 500 | 14.374920968s |     SERVER-IP | POST     "/v1/messages?beta=true"
Jan 25 14:13:53 SERVER ollama[80186]: time=2026-01-25T14:13:53.317Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=4096 prompt=6926 keep=4 new=4096
Jan 25 14:20:39 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:20:39 | 200 |         6m46s |     SERVER-IP | POST     "/v1/chat/completions"

<!-- gh-comment-id:3796761243 --> @LyudmilArkov commented on GitHub (Jan 25, 2026): Update: Detailed testing reveals multiple distinct issues. ## Test Setup All tests switched to base `gpt-oss:20b` model on: - Server: Ollama 0.14.3, Ubuntu 24.04, AMD Ryzen 5 (6c/12t), 32GB RAM, CPU-only - Default model config: 4096 context, 6 threads ## Test 1: Direct API Calls (Works as expected) **Commands:** ```bash payload='{"model": "gpt-oss:20b","messages": [{"role": "user", "content": "Write hello world in Python"}],"max_tokens": 500,"stream": false}' # Remote call time curl -s SERVER:11434/v1/chat/completions -d "$payload" | jq -c .usage # Local call via SSH time ssh SERVER curl -s localhost:11434/v1/chat/completions -d "'$payload'" | jq -c .usage ``` **Results:** ``` === Remote call === {"prompt_tokens":72,"completion_tokens":135,"total_tokens":207} real: 22.023s === Local call (via SSH) === {"prompt_tokens":72,"completion_tokens":160,"total_tokens":232} real: 22.377s ``` **Server logs:** ``` [GIN] | 200 | 21.990674083s | CLIENT | POST "/v1/chat/completions" [GIN] | 200 | 18.963631881s | ::1 | POST "/v1/chat/completions" ``` ✅ **Conclusion:** `/v1/chat/completions` endpoint works ok. Remote and local performance are identical (~22s). --- ## Test 2: Claude Code (Anthropic API Issues) **Command:** Same prompt "Write hello world in Python" via Claude Code TUI **Result:** 3m 46s **Server logs:** ``` 14:09:33 | 404 | 7.444µs | CLIENT | POST "/v1/messages/count_tokens?beta=true" 14:09:33 | 404 | 4.56157ms | CLIENT | POST "/v1/messages?beta=true" 14:09:37 | 404 | 769.095µs | CLIENT | POST "/v1/messages?beta=true" 14:09:37 | 404 | 858.846µs | CLIENT | POST "/v1/messages?beta=true" 14:09:37 WARN: truncating input prompt limit=4096 prompt=11090 keep=4 new=4096 14:13:23 | 200 | 3m45s | CLIENT | POST "/v1/messages?beta=true" 14:13:23 WARN: truncating input prompt limit=4096 prompt=13338 keep=4 new=4096 14:13:37 | 500 | 14.374920968s | CLIENT | POST "/v1/messages?beta=true" ``` **Issues observed:** 1. Multiple 404 errors on `/v1/messages` endpoint 2. After retries, eventually returns 200 after 3m45s 3. Follow-up request fails with 500 error 4. Sends 11,090-13,338 token prompts for simple "hello world" task 5. Base model's 4096 context truncates the prompt --- ## Test 3: OpenCode (Slow but Works) **Command:** Same prompt "Write hello world in Python" via OpenCode TUI **Result:** 6m 46s **Server logs:** ``` 14:13:53 WARN: truncating input prompt limit=4096 prompt=6926 keep=4 new=4096 14:20:39 | 200 | 6m46s | CLIENT | POST "/v1/chat/completions" ``` **Issues observed:** 1. Uses `/v1/chat/completions` (the working endpoint) 2. Takes 6m46s vs 22s for direct curl (18x slower) 3. Sends 6,926 token prompt for simple task 4. Context truncation occurs --- ## Performance Summary | Method | Endpoint | Time | Prompt Size | Status | |--------|----------|------|-------------|--------| | curl (remote) | `/v1/chat/completions` | 22s | 72 tokens | ✅ Ok | | curl (local) | `/v1/chat/completions` | 22s | 72 tokens | ✅ Ok | | Claude Code | `/v1/messages` | 3m 46s | 11K-13K tokens | ⚠️ Flaky (404s, 500s) | | OpenCode | `/v1/chat/completions` | 6m 46s | 6.9K tokens | ⚠️ Works but slow | --- ## Issues Identified ### 1. `/v1/messages` Endpoint is Unreliable The Anthropic Messages API implementation returns frequent 404 and 500 errors, requiring multiple retries. ### 2. Tools Send Massive Prompts Both Claude Code and OpenCode load extensive context (6K-13K tokens) even for trivial tasks, vs 72 tokens for direct API calls. ### 3. Base Model Context Insufficient Default 4096 context causes prompt truncation, requiring custom model variants with 32K+ context for these tools. ### 4. Tool Overhead Even when working, OpenCode is 18x slower than direct API calls (6m46s vs 22s), suggesting significant processing overhead in the tool itself. --- ## Conclusions **Original issue "40-50x slower over network" diagnosis was partially incorrect.** The actual issues are: 1. ✅ `/v1/chat/completions` works perfectly (identical performance remote vs local) 2. ❌ `/v1/messages` is unreliable (frequent 404/500 errors, per logs above) 3. ⚠️ Tool integrations send massive context and have significant overhead 4. ⚠️ Base models need larger context (32K+) for these tools - this is expected I guess **The `/v1/chat/completions` endpoint may have no network performance bug.** The slowness may be primarily due to: - `/v1/messages` endpoint reliability issues - Excessive context loading by the integration tools - Tool processing overhead --- ## Actual log ``` Jan 25 14:08:57 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:08:57 | 200 | 21.990674083s | SERVER-IP | POST "/v1/chat/completions" Jan 25 14:09:20 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:20 | 200 | 18.963631881s | ::1 | POST "/v1/chat/completions" Jan 25 14:09:33 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:33 | 404 | 7.444µs | SERVER-IP | POST "/v1/messages/count_tokens?beta=true" Jan 25 14:09:33 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:33 | 404 | 4.56157ms | SERVER-IP | POST "/v1/messages?beta=true" Jan 25 14:09:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:37 | 404 | 858.846µs | SERVER-IP | POST "/v1/messages?beta=true" Jan 25 14:09:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:37 | 404 | 769.095µs | SERVER-IP | POST "/v1/messages?beta=true" Jan 25 14:09:37 SERVER ollama[80186]: time=2026-01-25T14:09:37.587Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=4096 prompt=11090 keep=4 new=4096 Jan 25 14:09:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:37 | 404 | 4.489µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:09:38 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:38 | 404 | 4.529µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:09:40 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:40 | 404 | 4.799µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:09:44 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:44 | 404 | 4.469µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:09:53 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:53 | 404 | 4.389µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:10:05 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:10:05 | 404 | 28.024µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:10:23 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:10:23 | 404 | 4.248µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:10:48 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:10:48 | 404 | 4.389µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:11:18 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:11:18 | 404 | 5.19µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:11:48 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:11:48 | 404 | 4.729µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:12:18 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:12:18 | 404 | 6.613µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:12:48 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:12:48 | 404 | 5.079µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:13:18 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:18 | 404 | 6.021µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:13:23 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:23 | 200 | 3m45s | SERVER-IP | POST "/v1/messages?beta=true" Jan 25 14:13:23 SERVER ollama[80186]: time=2026-01-25T14:13:23.592Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=4096 prompt=13338 keep=4 new=4096 Jan 25 14:13:28 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:28 | 404 | 5.42µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:13:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:37 | 404 | 5.2µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:13:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:37 | 404 | 4.599µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:13:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:37 | 500 | 14.374920968s | SERVER-IP | POST "/v1/messages?beta=true" Jan 25 14:13:53 SERVER ollama[80186]: time=2026-01-25T14:13:53.317Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=4096 prompt=6926 keep=4 new=4096 Jan 25 14:20:39 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:20:39 | 200 | 6m46s | SERVER-IP | POST "/v1/chat/completions" ```
Author
Owner

@rick-github commented on GitHub (Jan 25, 2026):

Ollama doesn't support Claude telemetry, set the following environment variables to prevent the 404s:

DISABLE_TELEMETRY=1
DISABLE_ERROR_REPORTING=1
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
<!-- gh-comment-id:3796830780 --> @rick-github commented on GitHub (Jan 25, 2026): Ollama doesn't support Claude telemetry, set the following environment variables to prevent the 404s: ``` DISABLE_TELEMETRY=1 DISABLE_ERROR_REPORTING=1 CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 ```
Author
Owner

@LyudmilArkov commented on GitHub (Jan 25, 2026):

Thanks for the telemetry tip @rick-github - I've set those environment variables to clean up the logs.

However, the telemetry 404s are not the core issue. The performance gap remains unexplained:

Direct API Performance (Working Perfectly)

Remote curl:

time curl -s SERVER:11434/v1/chat/completions -d '{"model":"gpt-oss:20b",...}'
# Result: 22 seconds

Server log:

[GIN] | 200 | 21.990674083s | CLIENT | POST "/v1/chat/completions"

Tool Performance (10-18x Slower)

Claude Code: 3m 46s (10.2x slower)

14:13:23 | 200 | 3m45s | CLIENT | POST "/v1/messages?beta=true"

OpenCode: 6m 46s (18.4x slower)

14:20:39 | 200 | 6m46s | CLIENT | POST "/v1/chat/completions"

All three tests:

  • Same simple prompt: "Write hello world in Python"
  • Same model: gpt-oss:20b
  • Same server
  • Run consecutively

The Actual Issue

Why do Claude Code and OpenCode take 10-18x longer than direct API calls for identical prompts?

Server logs show the tools send massive prompts (6K-13K tokens vs 72 tokens for curl), but even accounting for that, there's unexplained overhead.

Is this expected behavior? Should tools using Ollama's API be 10-18x slower than direct curl calls?

<!-- gh-comment-id:3796874893 --> @LyudmilArkov commented on GitHub (Jan 25, 2026): Thanks for the telemetry tip @rick-github - I've set those environment variables to clean up the logs. However, the telemetry 404s are not the core issue. The **performance gap** remains unexplained: ## Direct API Performance (Working Perfectly) **Remote curl:** ```bash time curl -s SERVER:11434/v1/chat/completions -d '{"model":"gpt-oss:20b",...}' # Result: 22 seconds ``` **Server log:** ``` [GIN] | 200 | 21.990674083s | CLIENT | POST "/v1/chat/completions" ``` ## Tool Performance (10-18x Slower) **Claude Code:** 3m 46s (10.2x slower) ``` 14:13:23 | 200 | 3m45s | CLIENT | POST "/v1/messages?beta=true" ``` **OpenCode:** 6m 46s (18.4x slower) ``` 14:20:39 | 200 | 6m46s | CLIENT | POST "/v1/chat/completions" ``` **All three tests:** - Same simple prompt: "Write hello world in Python" - Same model: `gpt-oss:20b` - Same server - Run consecutively ## The Actual Issue Why do Claude Code and OpenCode take **10-18x longer** than direct API calls for identical prompts? Server logs show the tools send massive prompts (6K-13K tokens vs 72 tokens for curl), but even accounting for that, there's unexplained overhead. Is this expected behavior? Should tools using Ollama's API be 10-18x slower than direct curl calls?
Author
Owner

@rick-github commented on GitHub (Jan 25, 2026):

It takes longer to process the opencode prompt because the prompt is 9336 tokens, not 72 tokens. From the logs, the context size hasn't been increased, so the prompt is truncated and the effective length is 4096 tokens.

Compare the time taken to process a short prompt ($RANDOM to invalidate prompt caching, max_tokens to remove eval processing time):

$ echo '{"model":"gpt-oss:20b","messages":[{"role":"user","content":'"$(yes $[RANDOM%10] Write hello world in Python. | head -1 | jq -sR)"'}],"stream":false,"max_tokens":5}' | /usr/bin/time -f '%e' curl -s localhost:11434/v1/chat/completions -d@- | jq -c .usage
{"prompt_tokens":74,"completion_tokens":5,"total_tokens":79}
0.55

A 9k prompt with prompt truncation:

$ echo '{"model":"gpt-oss:20b","messages":[{"role":"user","content":'"$(yes $[RANDOM%10] Write hello world in Python. | head -1324 | jq -sR)"'}],"stream":false,"max_tokens":5}' | /usr/bin/time -f '%e' curl -s localhost:11434/v1/chat/completions -d@- | jq -c .usage
{"prompt_tokens":4096,"completion_tokens":5,"total_tokens":4101}
28.79

A 9k prompt without prompt truncation:

$ echo '{"model":"gpt-oss:20b-c128k","messages":[{"role":"user","content":'"$(yes $[RANDOM%10] Write hello world in Python. | head -1324 | jq -sR)"'}],"stream":false,"max_tokens":5}' | /usr/bin/time -f '%e' curl -s localhost:11434/v1/chat/completions -d@- | jq -c .usage
{"prompt_tokens":9335,"completion_tokens":5,"total_tokens":9340}
68.46

opencode with prompt truncation, including eval time:

$ ollama run gpt-oss:20b $RANDOM hello >/dev/null ; /usr/bin/time -f '%e' opencode run "Write hello world in Python" --model ollama/gpt-oss:20b >/dev/null
33.70

opencode without prompt truncation, including eval time:

$ ollama run gpt-oss:20b-c128k $RANDOM hello >/dev/null ; /usr/bin/time -f '%e' opencode run "Write hello world in Python" --model ollama/gpt-oss:20b-c128k >/dev/null
76.08

More prompt, more processing time.

<!-- gh-comment-id:3797065787 --> @rick-github commented on GitHub (Jan 25, 2026): It takes longer to process the opencode prompt because the prompt is 9336 tokens, not 72 tokens. From the logs, the context size hasn't been increased, so the prompt is truncated and the effective length is 4096 tokens. Compare the time taken to process a short prompt (`$RANDOM` to invalidate prompt caching, `max_tokens` to remove eval processing time): ```console $ echo '{"model":"gpt-oss:20b","messages":[{"role":"user","content":'"$(yes $[RANDOM%10] Write hello world in Python. | head -1 | jq -sR)"'}],"stream":false,"max_tokens":5}' | /usr/bin/time -f '%e' curl -s localhost:11434/v1/chat/completions -d@- | jq -c .usage {"prompt_tokens":74,"completion_tokens":5,"total_tokens":79} 0.55 ``` A 9k prompt with prompt truncation: ```console $ echo '{"model":"gpt-oss:20b","messages":[{"role":"user","content":'"$(yes $[RANDOM%10] Write hello world in Python. | head -1324 | jq -sR)"'}],"stream":false,"max_tokens":5}' | /usr/bin/time -f '%e' curl -s localhost:11434/v1/chat/completions -d@- | jq -c .usage {"prompt_tokens":4096,"completion_tokens":5,"total_tokens":4101} 28.79 ``` A 9k prompt without prompt truncation: ```console $ echo '{"model":"gpt-oss:20b-c128k","messages":[{"role":"user","content":'"$(yes $[RANDOM%10] Write hello world in Python. | head -1324 | jq -sR)"'}],"stream":false,"max_tokens":5}' | /usr/bin/time -f '%e' curl -s localhost:11434/v1/chat/completions -d@- | jq -c .usage {"prompt_tokens":9335,"completion_tokens":5,"total_tokens":9340} 68.46 ``` opencode with prompt truncation, including eval time: ```console $ ollama run gpt-oss:20b $RANDOM hello >/dev/null ; /usr/bin/time -f '%e' opencode run "Write hello world in Python" --model ollama/gpt-oss:20b >/dev/null 33.70 ``` opencode without prompt truncation, including eval time: ```console $ ollama run gpt-oss:20b-c128k $RANDOM hello >/dev/null ; /usr/bin/time -f '%e' opencode run "Write hello world in Python" --model ollama/gpt-oss:20b-c128k >/dev/null 76.08 ``` More prompt, more processing time.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9094