[GH-ISSUE #13899] /v1/chat/completions endpoint 40-50x slower over network #9094

New Issue

GiteaMirror · 2026-04-12T21:56:44-05:00

GiteaMirror commented

2026-04-12 21:56:44 -05:00

Originally created by @LyudmilArkov on GitHub (Jan 25, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13899

What is the issue?

`/v1/chat/completions` endpoint 40-50x slower over network

Summary

The /v1/chat/completions endpoint has severe performance degradation when accessed over network (40-50x slower than local access or native /api/generate), breaking Claude Code and OpenCode integrations per official docs.

Environment

Ollama: 0.14.3
Server: Ubuntu 24.04, AMD Ryzen 5 (6c/12t), 32GB RAM, CPU-only
Network: LAN (same subnet)

Models Tested

Both exhibit same issue:

qwen3-coder:30b (Q4_K_M quantization)
gpt-oss:20b

Model Configurations Tested

Multiple variants with context sizes and thread counts:

Context: 32K, 64K, 128K tokens
Threads: 6, 12
Batch size: 512

All variants show same remote slowness

Official Integration Docs Followed

Both integrations fail due to this performance issue.

Reproduction

Local (works ✅):

# On server
time ollama run gpt-oss:20b "Write hello world in Python"
# ~15-20 seconds

Remote native API (works ✅):

# From remote client
curl http://SERVER:11434/api/generate -d '{
  "model": "gpt-oss:20b",
  "prompt": "Write hello world in Python",
  "stream": false
}'
# ~20 seconds
# 72 prompt tokens, 92 completion tokens

Remote OpenAI-compatible API (broken ❌):

# From remote client
curl -X POST http://SERVER:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss:20b",
    "messages": [{"role": "user", "content": "Write hello world in Python"}],
    "max_tokens": 500,
    "stream": false
  }'
# ~14 minutes
# 72 prompt tokens, 140 completion tokens

Performance Data

Endpoint	Location	Time	Tokens/sec	Status
`ollama run`	Local	15s	20-30	✅
`/api/generate`	Remote	20s	15-20	✅
`/v1/chat/completions`	Local	20s	15-20	✅
`/v1/chat/completions`	Remote	14m	0.3-0.5	❌

Same prompt, same model, same tokens - only remote /v1/chat/completions is 40-50x slower.

Impact

Breaks tools requiring OpenAI-compatible format:

Claude Code (official Anthropic integration)
OpenCode (official integration)
Any OpenAI SDK client accessing Ollama remotely

Tools using native API work fine:

Apollo (iOS)
Reins (iOS)

Attempted Workarounds

LiteLLM proxy: Same issue (uses /v1/chat/completions under hood)
Streaming disabled: "stream": false has no effect
Different models/contexts: All affected equally
Thread optimization: No improvement for remote calls

Server Logs

[GIN] | 500 | 9m26s | CLIENT_IP | POST "/v1/chat/completions"

No error details, just long processing time followed by 500.

Network Verification

/v1/models endpoint responds instantly
Native /api/generate performs normally
No firewalls/proxies between client and server
OLLAMA_HOST=0.0.0.0:11434

Hypothesis

The /v1/chat/completions endpoint likely has per-token network overhead or inefficient buffering that doesn't affect local loopback connections or the native API.

Relevant log output

OS

Linux

GPU

No response

CPU

AMD

Ollama version

0.14.3

Originally created by @LyudmilArkov on GitHub (Jan 25, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13899 ### What is the issue? # `/v1/chat/completions` endpoint 40-50x slower over network ## Summary The `/v1/chat/completions` endpoint has severe performance degradation when accessed over network (40-50x slower than local access or native `/api/generate`), breaking Claude Code and OpenCode integrations per official docs. ## Environment - **Ollama**: 0.14.3 - **Server**: Ubuntu 24.04, AMD Ryzen 5 (6c/12t), 32GB RAM, CPU-only - **Network**: LAN (same subnet) ## Models Tested Both exhibit same issue: - `qwen3-coder:30b` (Q4_K_M quantization) - `gpt-oss:20b` ## Model Configurations Tested Multiple variants with context sizes and thread counts: - Context: 32K, 64K, 128K tokens - Threads: 6, 12 - Batch size: 512 All variants show same remote slowness ## Official Integration Docs Followed - https://docs.ollama.com/integrations/claude-code - https://docs.ollama.com/integrations/opencode Both integrations fail due to this performance issue. ## Reproduction ### Local (works ✅): ```bash # On server time ollama run gpt-oss:20b "Write hello world in Python" # ~15-20 seconds ``` ### Remote native API (works ✅): ```bash # From remote client curl http://SERVER:11434/api/generate -d '{ "model": "gpt-oss:20b", "prompt": "Write hello world in Python", "stream": false }' # ~20 seconds # 72 prompt tokens, 92 completion tokens ``` ### Remote OpenAI-compatible API (broken ❌): ```bash # From remote client curl -X POST http://SERVER:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-oss:20b", "messages": [{"role": "user", "content": "Write hello world in Python"}], "max_tokens": 500, "stream": false }' # ~14 minutes # 72 prompt tokens, 140 completion tokens ``` ## Performance Data | Endpoint | Location | Time | Tokens/sec | Status | |----------|----------|------|------------|--------| | `ollama run` | Local | 15s | 20-30 | ✅ | | `/api/generate` | Remote | 20s | 15-20 | ✅ | | `/v1/chat/completions` | Local | 20s | 15-20 | ✅ | | `/v1/chat/completions` | Remote | 14m | 0.3-0.5 | ❌ | Same prompt, same model, same tokens - only remote `/v1/chat/completions` is 40-50x slower. ## Impact Breaks tools requiring OpenAI-compatible format: - Claude Code (official Anthropic integration) - OpenCode (official integration) - Any OpenAI SDK client accessing Ollama remotely Tools using native API work fine: - Apollo (iOS) - Reins (iOS) ## Attempted Workarounds - **LiteLLM proxy**: Same issue (uses `/v1/chat/completions` under hood) - **Streaming disabled**: `"stream": false` has no effect - **Different models/contexts**: All affected equally - **Thread optimization**: No improvement for remote calls ## Server Logs ``` [GIN] | 500 | 9m26s | CLIENT_IP | POST "/v1/chat/completions" ``` No error details, just long processing time followed by 500. ## Network Verification - `/v1/models` endpoint responds instantly - Native `/api/generate` performs normally - No firewalls/proxies between client and server - `OLLAMA_HOST=0.0.0.0:11434` ## Hypothesis The `/v1/chat/completions` endpoint likely has per-token network overhead or inefficient buffering that doesn't affect local loopback connections or the native API. ### Relevant log output ```shell ``` ### OS Linux ### GPU _No response_ ### CPU AMD ### Ollama version 0.14.3

GiteaMirror added the bug label 2026-04-12 21:56:44 -05:00

GiteaMirror closed this issue

2026-04-12 21:56:45 -05:00

GiteaMirror commented

2026-04-12 21:56:46 -05:00

@rick-github commented on GitHub (Jan 25, 2026):

Server log may help in debugging.

$ ssh $SERVER ollama -v
ollama version is 0.14.3

$ payload='{"model": "gpt-oss:20b","messages": [{"role": "user", "content": "Write hello world in Python"}],"max_tokens": 500,"stream": false}'

$ time curl -s $SERVER:11434/v1/chat/completions -d "$payload" | jq -c .usage
{"prompt_tokens":72,"completion_tokens":202,"total_tokens":274}

real	0m1.356s
user	0m0.010s
sys	0m0.015s

$ time ssh $SERVER curl -s localhost:11434/v1/chat/completions -d "'$payload'" | jq -c .usage
{"prompt_tokens":72,"completion_tokens":111,"total_tokens":183}

real	0m1.065s
user	0m0.023s
sys	0m0.015s

@rick-github commented on GitHub (Jan 25, 2026): [Server log](https://docs.ollama.com/troubleshooting) may help in debugging. ```console $ ssh $SERVER ollama -v ollama version is 0.14.3 $ payload='{"model": "gpt-oss:20b","messages": [{"role": "user", "content": "Write hello world in Python"}],"max_tokens": 500,"stream": false}' $ time curl -s $SERVER:11434/v1/chat/completions -d "$payload" | jq -c .usage {"prompt_tokens":72,"completion_tokens":202,"total_tokens":274} real 0m1.356s user 0m0.010s sys 0m0.015s $ time ssh $SERVER curl -s localhost:11434/v1/chat/completions -d "'$payload'" | jq -c .usage {"prompt_tokens":72,"completion_tokens":111,"total_tokens":183} real 0m1.065s user 0m0.023s sys 0m0.015s ```

GiteaMirror commented

2026-04-12 21:56:47 -05:00

@LyudmilArkov commented on GitHub (Jan 25, 2026):

Update: Detailed testing reveals multiple distinct issues.

Test Setup

All tests switched to base gpt-oss:20b model on:

Server: Ollama 0.14.3, Ubuntu 24.04, AMD Ryzen 5 (6c/12t), 32GB RAM, CPU-only
Default model config: 4096 context, 6 threads

Test 1: Direct API Calls (Works as expected)

Commands:

payload='{"model": "gpt-oss:20b","messages": [{"role": "user", "content": "Write hello world in Python"}],"max_tokens": 500,"stream": false}'

# Remote call
time curl -s SERVER:11434/v1/chat/completions -d "$payload" | jq -c .usage

# Local call via SSH
time ssh SERVER curl -s localhost:11434/v1/chat/completions -d "'$payload'" | jq -c .usage

Results:

=== Remote call ===
{"prompt_tokens":72,"completion_tokens":135,"total_tokens":207}
real: 22.023s

=== Local call (via SSH) ===
{"prompt_tokens":72,"completion_tokens":160,"total_tokens":232}
real: 22.377s

Server logs:

[GIN] | 200 | 21.990674083s | CLIENT | POST "/v1/chat/completions"
[GIN] | 200 | 18.963631881s | ::1    | POST "/v1/chat/completions"

✅ Conclusion: /v1/chat/completions endpoint works ok. Remote and local performance are identical (~22s).

Test 2: Claude Code (Anthropic API Issues)

Command: Same prompt "Write hello world in Python" via Claude Code TUI

Result: 3m 46s

Server logs:

14:09:33 | 404 |     7.444µs | CLIENT | POST "/v1/messages/count_tokens?beta=true"
14:09:33 | 404 | 4.56157ms   | CLIENT | POST "/v1/messages?beta=true"
14:09:37 | 404 |   769.095µs | CLIENT | POST "/v1/messages?beta=true"
14:09:37 | 404 |   858.846µs | CLIENT | POST "/v1/messages?beta=true"
14:09:37 WARN: truncating input prompt limit=4096 prompt=11090 keep=4 new=4096
14:13:23 | 200 | 3m45s       | CLIENT | POST "/v1/messages?beta=true"
14:13:23 WARN: truncating input prompt limit=4096 prompt=13338 keep=4 new=4096
14:13:37 | 500 | 14.374920968s | CLIENT | POST "/v1/messages?beta=true"

Issues observed:

Multiple 404 errors on /v1/messages endpoint
After retries, eventually returns 200 after 3m45s
Follow-up request fails with 500 error
Sends 11,090-13,338 token prompts for simple "hello world" task
Base model's 4096 context truncates the prompt

Test 3: OpenCode (Slow but Works)

Command: Same prompt "Write hello world in Python" via OpenCode TUI

Result: 6m 46s

Server logs:

14:13:53 WARN: truncating input prompt limit=4096 prompt=6926 keep=4 new=4096
14:20:39 | 200 | 6m46s | CLIENT | POST "/v1/chat/completions"

Issues observed:

Uses /v1/chat/completions (the working endpoint)
Takes 6m46s vs 22s for direct curl (18x slower)
Sends 6,926 token prompt for simple task
Context truncation occurs

Performance Summary

Method	Endpoint	Time	Prompt Size	Status
curl (remote)	`/v1/chat/completions`	22s	72 tokens	✅ Ok
curl (local)	`/v1/chat/completions`	22s	72 tokens	✅ Ok
Claude Code	`/v1/messages`	3m 46s	11K-13K tokens	⚠️ Flaky (404s, 500s)
OpenCode	`/v1/chat/completions`	6m 46s	6.9K tokens	⚠️ Works but slow

Issues Identified

1. `/v1/messages` Endpoint is Unreliable

The Anthropic Messages API implementation returns frequent 404 and 500 errors, requiring multiple retries.

2. Tools Send Massive Prompts

Both Claude Code and OpenCode load extensive context (6K-13K tokens) even for trivial tasks, vs 72 tokens for direct API calls.

3. Base Model Context Insufficient

Default 4096 context causes prompt truncation, requiring custom model variants with 32K+ context for these tools.

4. Tool Overhead

Even when working, OpenCode is 18x slower than direct API calls (6m46s vs 22s), suggesting significant processing overhead in the tool itself.

Conclusions

Original issue "40-50x slower over network" diagnosis was partially incorrect.

The actual issues are:

✅ /v1/chat/completions works perfectly (identical performance remote vs local)
❌ /v1/messages is unreliable (frequent 404/500 errors, per logs above)
⚠️ Tool integrations send massive context and have significant overhead
⚠️ Base models need larger context (32K+) for these tools - this is expected I guess

The /v1/chat/completions endpoint may have no network performance bug. The slowness may be primarily due to:

/v1/messages endpoint reliability issues
Excessive context loading by the integration tools
Tool processing overhead

Actual log


Jan 25 14:08:57 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:08:57 | 200 | 21.990674083s |     SERVER-IP | POST     "/v1/chat/completions"
Jan 25 14:09:20 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:20 | 200 | 18.963631881s |             ::1 | POST     "/v1/chat/completions"
Jan 25 14:09:33 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:33 | 404 |       7.444µs |     SERVER-IP | POST     "/v1/messages/count_tokens?beta=true"
Jan 25 14:09:33 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:33 | 404 |     4.56157ms |     SERVER-IP | POST     "/v1/messages?beta=true"
Jan 25 14:09:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:37 | 404 |     858.846µs |     SERVER-IP | POST     "/v1/messages?beta=true"
Jan 25 14:09:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:37 | 404 |     769.095µs |     SERVER-IP | POST     "/v1/messages?beta=true"
Jan 25 14:09:37 SERVER ollama[80186]: time=2026-01-25T14:09:37.587Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=4096 prompt=11090 keep=4 new=4096
Jan 25 14:09:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:37 | 404 |       4.489µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:09:38 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:38 | 404 |       4.529µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:09:40 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:40 | 404 |       4.799µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:09:44 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:44 | 404 |       4.469µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:09:53 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:53 | 404 |       4.389µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:10:05 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:10:05 | 404 |      28.024µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:10:23 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:10:23 | 404 |       4.248µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:10:48 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:10:48 | 404 |       4.389µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:11:18 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:11:18 | 404 |        5.19µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:11:48 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:11:48 | 404 |       4.729µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:12:18 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:12:18 | 404 |       6.613µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:12:48 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:12:48 | 404 |       5.079µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:13:18 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:18 | 404 |       6.021µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:13:23 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:23 | 200 |         3m45s |     SERVER-IP | POST     "/v1/messages?beta=true"
Jan 25 14:13:23 SERVER ollama[80186]: time=2026-01-25T14:13:23.592Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=4096 prompt=13338 keep=4 new=4096
Jan 25 14:13:28 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:28 | 404 |        5.42µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:13:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:37 | 404 |         5.2µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:13:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:37 | 404 |       4.599µs |     SERVER-IP | POST     "/api/event_logging/batch"
Jan 25 14:13:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:37 | 500 | 14.374920968s |     SERVER-IP | POST     "/v1/messages?beta=true"
Jan 25 14:13:53 SERVER ollama[80186]: time=2026-01-25T14:13:53.317Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=4096 prompt=6926 keep=4 new=4096
Jan 25 14:20:39 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:20:39 | 200 |         6m46s |     SERVER-IP | POST     "/v1/chat/completions"

@LyudmilArkov commented on GitHub (Jan 25, 2026): Update: Detailed testing reveals multiple distinct issues. ## Test Setup All tests switched to base `gpt-oss:20b` model on: - Server: Ollama 0.14.3, Ubuntu 24.04, AMD Ryzen 5 (6c/12t), 32GB RAM, CPU-only - Default model config: 4096 context, 6 threads ## Test 1: Direct API Calls (Works as expected) **Commands:** ```bash payload='{"model": "gpt-oss:20b","messages": [{"role": "user", "content": "Write hello world in Python"}],"max_tokens": 500,"stream": false}' # Remote call time curl -s SERVER:11434/v1/chat/completions -d "$payload" | jq -c .usage # Local call via SSH time ssh SERVER curl -s localhost:11434/v1/chat/completions -d "'$payload'" | jq -c .usage ``` **Results:** ``` === Remote call === {"prompt_tokens":72,"completion_tokens":135,"total_tokens":207} real: 22.023s === Local call (via SSH) === {"prompt_tokens":72,"completion_tokens":160,"total_tokens":232} real: 22.377s ``` **Server logs:** ``` [GIN] | 200 | 21.990674083s | CLIENT | POST "/v1/chat/completions" [GIN] | 200 | 18.963631881s | ::1 | POST "/v1/chat/completions" ``` ✅ **Conclusion:** `/v1/chat/completions` endpoint works ok. Remote and local performance are identical (~22s). --- ## Test 2: Claude Code (Anthropic API Issues) **Command:** Same prompt "Write hello world in Python" via Claude Code TUI **Result:** 3m 46s **Server logs:** ``` 14:09:33 | 404 | 7.444µs | CLIENT | POST "/v1/messages/count_tokens?beta=true" 14:09:33 | 404 | 4.56157ms | CLIENT | POST "/v1/messages?beta=true" 14:09:37 | 404 | 769.095µs | CLIENT | POST "/v1/messages?beta=true" 14:09:37 | 404 | 858.846µs | CLIENT | POST "/v1/messages?beta=true" 14:09:37 WARN: truncating input prompt limit=4096 prompt=11090 keep=4 new=4096 14:13:23 | 200 | 3m45s | CLIENT | POST "/v1/messages?beta=true" 14:13:23 WARN: truncating input prompt limit=4096 prompt=13338 keep=4 new=4096 14:13:37 | 500 | 14.374920968s | CLIENT | POST "/v1/messages?beta=true" ``` **Issues observed:** 1. Multiple 404 errors on `/v1/messages` endpoint 2. After retries, eventually returns 200 after 3m45s 3. Follow-up request fails with 500 error 4. Sends 11,090-13,338 token prompts for simple "hello world" task 5. Base model's 4096 context truncates the prompt --- ## Test 3: OpenCode (Slow but Works) **Command:** Same prompt "Write hello world in Python" via OpenCode TUI **Result:** 6m 46s **Server logs:** ``` 14:13:53 WARN: truncating input prompt limit=4096 prompt=6926 keep=4 new=4096 14:20:39 | 200 | 6m46s | CLIENT | POST "/v1/chat/completions" ``` **Issues observed:** 1. Uses `/v1/chat/completions` (the working endpoint) 2. Takes 6m46s vs 22s for direct curl (18x slower) 3. Sends 6,926 token prompt for simple task 4. Context truncation occurs --- ## Performance Summary | Method | Endpoint | Time | Prompt Size | Status | |--------|----------|------|-------------|--------| | curl (remote) | `/v1/chat/completions` | 22s | 72 tokens | ✅ Ok | | curl (local) | `/v1/chat/completions` | 22s | 72 tokens | ✅ Ok | | Claude Code | `/v1/messages` | 3m 46s | 11K-13K tokens | ⚠️ Flaky (404s, 500s) | | OpenCode | `/v1/chat/completions` | 6m 46s | 6.9K tokens | ⚠️ Works but slow | --- ## Issues Identified ### 1. `/v1/messages` Endpoint is Unreliable The Anthropic Messages API implementation returns frequent 404 and 500 errors, requiring multiple retries. ### 2. Tools Send Massive Prompts Both Claude Code and OpenCode load extensive context (6K-13K tokens) even for trivial tasks, vs 72 tokens for direct API calls. ### 3. Base Model Context Insufficient Default 4096 context causes prompt truncation, requiring custom model variants with 32K+ context for these tools. ### 4. Tool Overhead Even when working, OpenCode is 18x slower than direct API calls (6m46s vs 22s), suggesting significant processing overhead in the tool itself. --- ## Conclusions **Original issue "40-50x slower over network" diagnosis was partially incorrect.** The actual issues are: 1. ✅ `/v1/chat/completions` works perfectly (identical performance remote vs local) 2. ❌ `/v1/messages` is unreliable (frequent 404/500 errors, per logs above) 3. ⚠️ Tool integrations send massive context and have significant overhead 4. ⚠️ Base models need larger context (32K+) for these tools - this is expected I guess **The `/v1/chat/completions` endpoint may have no network performance bug.** The slowness may be primarily due to: - `/v1/messages` endpoint reliability issues - Excessive context loading by the integration tools - Tool processing overhead --- ## Actual log ``` Jan 25 14:08:57 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:08:57 | 200 | 21.990674083s | SERVER-IP | POST "/v1/chat/completions" Jan 25 14:09:20 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:20 | 200 | 18.963631881s | ::1 | POST "/v1/chat/completions" Jan 25 14:09:33 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:33 | 404 | 7.444µs | SERVER-IP | POST "/v1/messages/count_tokens?beta=true" Jan 25 14:09:33 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:33 | 404 | 4.56157ms | SERVER-IP | POST "/v1/messages?beta=true" Jan 25 14:09:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:37 | 404 | 858.846µs | SERVER-IP | POST "/v1/messages?beta=true" Jan 25 14:09:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:37 | 404 | 769.095µs | SERVER-IP | POST "/v1/messages?beta=true" Jan 25 14:09:37 SERVER ollama[80186]: time=2026-01-25T14:09:37.587Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=4096 prompt=11090 keep=4 new=4096 Jan 25 14:09:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:37 | 404 | 4.489µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:09:38 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:38 | 404 | 4.529µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:09:40 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:40 | 404 | 4.799µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:09:44 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:44 | 404 | 4.469µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:09:53 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:09:53 | 404 | 4.389µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:10:05 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:10:05 | 404 | 28.024µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:10:23 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:10:23 | 404 | 4.248µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:10:48 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:10:48 | 404 | 4.389µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:11:18 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:11:18 | 404 | 5.19µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:11:48 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:11:48 | 404 | 4.729µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:12:18 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:12:18 | 404 | 6.613µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:12:48 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:12:48 | 404 | 5.079µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:13:18 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:18 | 404 | 6.021µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:13:23 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:23 | 200 | 3m45s | SERVER-IP | POST "/v1/messages?beta=true" Jan 25 14:13:23 SERVER ollama[80186]: time=2026-01-25T14:13:23.592Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=4096 prompt=13338 keep=4 new=4096 Jan 25 14:13:28 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:28 | 404 | 5.42µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:13:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:37 | 404 | 5.2µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:13:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:37 | 404 | 4.599µs | SERVER-IP | POST "/api/event_logging/batch" Jan 25 14:13:37 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:13:37 | 500 | 14.374920968s | SERVER-IP | POST "/v1/messages?beta=true" Jan 25 14:13:53 SERVER ollama[80186]: time=2026-01-25T14:13:53.317Z level=WARN source=runner.go:186 msg="truncating input prompt" limit=4096 prompt=6926 keep=4 new=4096 Jan 25 14:20:39 SERVER ollama[80186]: [GIN] 2026/01/25 - 14:20:39 | 200 | 6m46s | SERVER-IP | POST "/v1/chat/completions" ```

GiteaMirror commented

2026-04-12 21:56:48 -05:00

@rick-github commented on GitHub (Jan 25, 2026):

Ollama doesn't support Claude telemetry, set the following environment variables to prevent the 404s:

DISABLE_TELEMETRY=1
DISABLE_ERROR_REPORTING=1
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1

@rick-github commented on GitHub (Jan 25, 2026): Ollama doesn't support Claude telemetry, set the following environment variables to prevent the 404s: ``` DISABLE_TELEMETRY=1 DISABLE_ERROR_REPORTING=1 CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 ```

GiteaMirror commented

2026-04-12 21:56:48 -05:00

@LyudmilArkov commented on GitHub (Jan 25, 2026):

Thanks for the telemetry tip @rick-github - I've set those environment variables to clean up the logs.

However, the telemetry 404s are not the core issue. The performance gap remains unexplained:

Direct API Performance (Working Perfectly)

Remote curl:

time curl -s SERVER:11434/v1/chat/completions -d '{"model":"gpt-oss:20b",...}'
# Result: 22 seconds

Server log:

[GIN] | 200 | 21.990674083s | CLIENT | POST "/v1/chat/completions"

Tool Performance (10-18x Slower)

Claude Code: 3m 46s (10.2x slower)

14:13:23 | 200 | 3m45s | CLIENT | POST "/v1/messages?beta=true"

OpenCode: 6m 46s (18.4x slower)

14:20:39 | 200 | 6m46s | CLIENT | POST "/v1/chat/completions"

All three tests:

Same simple prompt: "Write hello world in Python"
Same model: gpt-oss:20b
Same server
Run consecutively

The Actual Issue

Why do Claude Code and OpenCode take 10-18x longer than direct API calls for identical prompts?

Server logs show the tools send massive prompts (6K-13K tokens vs 72 tokens for curl), but even accounting for that, there's unexplained overhead.

Is this expected behavior? Should tools using Ollama's API be 10-18x slower than direct curl calls?

@LyudmilArkov commented on GitHub (Jan 25, 2026): Thanks for the telemetry tip @rick-github - I've set those environment variables to clean up the logs. However, the telemetry 404s are not the core issue. The **performance gap** remains unexplained: ## Direct API Performance (Working Perfectly) **Remote curl:** ```bash time curl -s SERVER:11434/v1/chat/completions -d '{"model":"gpt-oss:20b",...}' # Result: 22 seconds ``` **Server log:** ``` [GIN] | 200 | 21.990674083s | CLIENT | POST "/v1/chat/completions" ``` ## Tool Performance (10-18x Slower) **Claude Code:** 3m 46s (10.2x slower) ``` 14:13:23 | 200 | 3m45s | CLIENT | POST "/v1/messages?beta=true" ``` **OpenCode:** 6m 46s (18.4x slower) ``` 14:20:39 | 200 | 6m46s | CLIENT | POST "/v1/chat/completions" ``` **All three tests:** - Same simple prompt: "Write hello world in Python" - Same model: `gpt-oss:20b` - Same server - Run consecutively ## The Actual Issue Why do Claude Code and OpenCode take **10-18x longer** than direct API calls for identical prompts? Server logs show the tools send massive prompts (6K-13K tokens vs 72 tokens for curl), but even accounting for that, there's unexplained overhead. Is this expected behavior? Should tools using Ollama's API be 10-18x slower than direct curl calls?

GiteaMirror commented

2026-04-12 21:56:49 -05:00

@rick-github commented on GitHub (Jan 25, 2026):

It takes longer to process the opencode prompt because the prompt is 9336 tokens, not 72 tokens. From the logs, the context size hasn't been increased, so the prompt is truncated and the effective length is 4096 tokens.

Compare the time taken to process a short prompt ($RANDOM to invalidate prompt caching, max_tokens to remove eval processing time):

$ echo '{"model":"gpt-oss:20b","messages":[{"role":"user","content":'"$(yes $[RANDOM%10] Write hello world in Python. | head -1 | jq -sR)"'}],"stream":false,"max_tokens":5}' | /usr/bin/time -f '%e' curl -s localhost:11434/v1/chat/completions -d@- | jq -c .usage
{"prompt_tokens":74,"completion_tokens":5,"total_tokens":79}
0.55

A 9k prompt with prompt truncation:

$ echo '{"model":"gpt-oss:20b","messages":[{"role":"user","content":'"$(yes $[RANDOM%10] Write hello world in Python. | head -1324 | jq -sR)"'}],"stream":false,"max_tokens":5}' | /usr/bin/time -f '%e' curl -s localhost:11434/v1/chat/completions -d@- | jq -c .usage
{"prompt_tokens":4096,"completion_tokens":5,"total_tokens":4101}
28.79

A 9k prompt without prompt truncation:

$ echo '{"model":"gpt-oss:20b-c128k","messages":[{"role":"user","content":'"$(yes $[RANDOM%10] Write hello world in Python. | head -1324 | jq -sR)"'}],"stream":false,"max_tokens":5}' | /usr/bin/time -f '%e' curl -s localhost:11434/v1/chat/completions -d@- | jq -c .usage
{"prompt_tokens":9335,"completion_tokens":5,"total_tokens":9340}
68.46

opencode with prompt truncation, including eval time:

$ ollama run gpt-oss:20b $RANDOM hello >/dev/null ; /usr/bin/time -f '%e' opencode run "Write hello world in Python" --model ollama/gpt-oss:20b >/dev/null
33.70

opencode without prompt truncation, including eval time:

$ ollama run gpt-oss:20b-c128k $RANDOM hello >/dev/null ; /usr/bin/time -f '%e' opencode run "Write hello world in Python" --model ollama/gpt-oss:20b-c128k >/dev/null
76.08

More prompt, more processing time.

@rick-github commented on GitHub (Jan 25, 2026): It takes longer to process the opencode prompt because the prompt is 9336 tokens, not 72 tokens. From the logs, the context size hasn't been increased, so the prompt is truncated and the effective length is 4096 tokens. Compare the time taken to process a short prompt (`$RANDOM` to invalidate prompt caching, `max_tokens` to remove eval processing time): ```console $ echo '{"model":"gpt-oss:20b","messages":[{"role":"user","content":'"$(yes $[RANDOM%10] Write hello world in Python. | head -1 | jq -sR)"'}],"stream":false,"max_tokens":5}' | /usr/bin/time -f '%e' curl -s localhost:11434/v1/chat/completions -d@- | jq -c .usage {"prompt_tokens":74,"completion_tokens":5,"total_tokens":79} 0.55 ``` A 9k prompt with prompt truncation: ```console $ echo '{"model":"gpt-oss:20b","messages":[{"role":"user","content":'"$(yes $[RANDOM%10] Write hello world in Python. | head -1324 | jq -sR)"'}],"stream":false,"max_tokens":5}' | /usr/bin/time -f '%e' curl -s localhost:11434/v1/chat/completions -d@- | jq -c .usage {"prompt_tokens":4096,"completion_tokens":5,"total_tokens":4101} 28.79 ``` A 9k prompt without prompt truncation: ```console $ echo '{"model":"gpt-oss:20b-c128k","messages":[{"role":"user","content":'"$(yes $[RANDOM%10] Write hello world in Python. | head -1324 | jq -sR)"'}],"stream":false,"max_tokens":5}' | /usr/bin/time -f '%e' curl -s localhost:11434/v1/chat/completions -d@- | jq -c .usage {"prompt_tokens":9335,"completion_tokens":5,"total_tokens":9340} 68.46 ``` opencode with prompt truncation, including eval time: ```console $ ollama run gpt-oss:20b $RANDOM hello >/dev/null ; /usr/bin/time -f '%e' opencode run "Write hello world in Python" --model ollama/gpt-oss:20b >/dev/null 33.70 ``` opencode without prompt truncation, including eval time: ```console $ ollama run gpt-oss:20b-c128k $RANDOM hello >/dev/null ; /usr/bin/time -f '%e' opencode run "Write hello world in Python" --model ollama/gpt-oss:20b-c128k >/dev/null 76.08 ``` More prompt, more processing time.

GiteaMirror referenced this issue

2026-04-22 12:22:30 -05:00

[GH-ISSUE #9094] 0.5.10 issue - Access is denied #31678

GiteaMirror referenced this issue

2026-04-22 12:22:36 -05:00

[GH-ISSUE #9094] 0.5.10 issue - Access is denied #31678

GiteaMirror referenced this issue

2026-04-28 23:15:03 -05:00

[GH-ISSUE #9094] 0.5.10 issue - Access is denied #52429

GiteaMirror referenced this issue

2026-04-28 23:15:14 -05:00

[GH-ISSUE #9094] 0.5.10 issue - Access is denied #52429

GiteaMirror referenced this issue

2026-05-04 12:09:20 -05:00

[GH-ISSUE #9094] 0.5.10 issue - Access is denied #67974

GiteaMirror referenced this issue

2026-05-04 12:09:23 -05:00

[GH-ISSUE #9094] 0.5.10 issue - Access is denied #67974

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

dhiltgen/llama-runner

hoyyeva/anthropic-local-image-path

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#9094

[GH-ISSUE #13899] /v1/chat/completions endpoint 40-50x slower over network #9094

What is the issue?

/v1/chat/completions endpoint 40-50x slower over network

Summary

Environment

Models Tested

Model Configurations Tested

Official Integration Docs Followed

Reproduction

Local (works ✅):

Remote native API (works ✅):

Remote OpenAI-compatible API (broken ❌):

Performance Data

Impact

Attempted Workarounds

Server Logs

Network Verification

Hypothesis

Relevant log output

OS

GPU

CPU

Ollama version

Test Setup

Test 1: Direct API Calls (Works as expected)

Test 2: Claude Code (Anthropic API Issues)

Test 3: OpenCode (Slow but Works)

Performance Summary

Issues Identified

1. /v1/messages Endpoint is Unreliable

2. Tools Send Massive Prompts

3. Base Model Context Insufficient

4. Tool Overhead

Conclusions

Actual log

Direct API Performance (Working Perfectly)

Tool Performance (10-18x Slower)

The Actual Issue

[GH-ISSUE #13899] `/v1/chat/completions` endpoint 40-50x slower over network #9094

`/v1/chat/completions` endpoint 40-50x slower over network

1. `/v1/messages` Endpoint is Unreliable