[GH-ISSUE #3844] api error occurred after some times request #28141

Closed
opened 2026-04-22 05:58:44 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @Shiyaoa on GitHub (Apr 23, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3844

What is the issue?

i try to post request using the url http://localhost:11434/v1 and model "llama3:8b-instruct-q8_0", it works successfully at the initially first time, but then failed with these information:
Error occurred: Error code: 400 - {'error': {'message': 'unexpected server status: 1', 'type': 'api_error', 'param': None, 'code': None}}

then i use model "wizardlm2:7b-q8_0", the same error occurred after 2418 requests.
28%|██▊ | 2418/8569 [4:53:46<12:27:18, 7.29s/it]
Error occurred: Error code: 400 - {'error': {'message': 'unexpected server status: 1', 'type': 'api_error', 'param': None, 'code': None}}

i have checked the logs, but i can't solve it.
[GIN] 2024/04/23 - 04:33:07 | 400 | 56.4274ms | 127.0.0.1 | POST "/v1/chat/completions"
time=2024-04-23T04:33:07.535+08:00 level=ERROR source=prompt.go:86 msg="failed to encode prompt" err="unexpected server status: 1"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.1.32

Originally created by @Shiyaoa on GitHub (Apr 23, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3844 ### What is the issue? i try to post request using the url http://localhost:11434/v1 and model "llama3:8b-instruct-q8_0", it works successfully at the initially first time, but then failed with these information: Error occurred: Error code: 400 - {'error': {'message': 'unexpected server status: 1', 'type': 'api_error', 'param': None, 'code': None}} then i use model "wizardlm2:7b-q8_0", the same error occurred after 2418 requests. 28%|██▊ | 2418/8569 [4:53:46<12:27:18, 7.29s/it] Error occurred: Error code: 400 - {'error': {'message': 'unexpected server status: 1', 'type': 'api_error', 'param': None, 'code': None}} i have checked the logs, but i can't solve it. [GIN] 2024/04/23 - 04:33:07 | 400 | 56.4274ms | 127.0.0.1 | POST "/v1/chat/completions" time=2024-04-23T04:33:07.535+08:00 level=ERROR source=prompt.go:86 msg="failed to encode prompt" err="unexpected server status: 1" ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.32
GiteaMirror added the bug label 2026-04-22 05:58:44 -05:00
Author
Owner

@seanmavley commented on GitHub (Apr 23, 2024):

I'm using phi3, and just got the error

Apr 23 22:29:04 KhoPhi ollama[198220]: {"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":35,"slot_id":0,"task_id":54,"tid":"1406203>
Apr 23 22:29:06 KhoPhi ollama[264]: [GIN] 2024/04/23 - 22:29:06 | 200 |  2.203791417s |       127.0.0.1 | POST     "/api/chat"
Apr 23 22:29:06 KhoPhi ollama[198220]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":0,"n_processing_slots":1,"task_id":94>
Apr 23 22:29:06 KhoPhi ollama[198220]: {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/completion","remot>
Apr 23 22:29:06 KhoPhi ollama[198220]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_add>
Apr 23 22:29:06 KhoPhi ollama[198220]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":0,"n_processing_slots":1,"task_id":97>
Apr 23 22:29:06 KhoPhi ollama[198220]: {"function":"update_slots","level":"INFO","line":1640,"msg":"slot released","n_cache_tokens":675,"n_ctx":2048,"n_past":674,"n_syste>
Apr 23 22:29:06 KhoPhi ollama[198220]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_add>
Apr 23 22:29:06 KhoPhi ollama[264]: time=2024-04-23T22:29:06.446Z level=ERROR source=prompt.go:86 msg="failed to encode prompt" err="unexpected server status: 1"
Apr 23 22:29:06 KhoPhi ollama[264]: [GIN] 2024/04/23 - 22:29:06 | 400 |   92.310101ms |       127.0.0.1 | POST     "/api/chat"

Specs

WSL2 Ubuntu 22.04 LTS
Nvidia 2060 rtx
i7
0.1.32 Ollama

<!-- gh-comment-id:2073585239 --> @seanmavley commented on GitHub (Apr 23, 2024): I'm using phi3, and just got the error ``` Apr 23 22:29:04 KhoPhi ollama[198220]: {"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":35,"slot_id":0,"task_id":54,"tid":"1406203> Apr 23 22:29:06 KhoPhi ollama[264]: [GIN] 2024/04/23 - 22:29:06 | 200 | 2.203791417s | 127.0.0.1 | POST "/api/chat" Apr 23 22:29:06 KhoPhi ollama[198220]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":0,"n_processing_slots":1,"task_id":94> Apr 23 22:29:06 KhoPhi ollama[198220]: {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/completion","remot> Apr 23 22:29:06 KhoPhi ollama[198220]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_add> Apr 23 22:29:06 KhoPhi ollama[198220]: {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":0,"n_processing_slots":1,"task_id":97> Apr 23 22:29:06 KhoPhi ollama[198220]: {"function":"update_slots","level":"INFO","line":1640,"msg":"slot released","n_cache_tokens":675,"n_ctx":2048,"n_past":674,"n_syste> Apr 23 22:29:06 KhoPhi ollama[198220]: {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_add> Apr 23 22:29:06 KhoPhi ollama[264]: time=2024-04-23T22:29:06.446Z level=ERROR source=prompt.go:86 msg="failed to encode prompt" err="unexpected server status: 1" Apr 23 22:29:06 KhoPhi ollama[264]: [GIN] 2024/04/23 - 22:29:06 | 400 | 92.310101ms | 127.0.0.1 | POST "/api/chat" ``` Specs WSL2 Ubuntu 22.04 LTS Nvidia 2060 rtx i7 0.1.32 Ollama
Author
Owner

@baptistejamin commented on GitHub (Apr 24, 2024):

I get the error with Phi3 on a Nvidia GPU but not an other one but the exact same parameters.

<!-- gh-comment-id:2074548427 --> @baptistejamin commented on GitHub (Apr 24, 2024): I get the error with Phi3 on a Nvidia GPU but not an other one but the exact same parameters.
Author
Owner

@sooraj12 commented on GitHub (Apr 24, 2024):

this has something to do with setting format="json" in llm model.

ChatOllama(
base_url=base_url, model=grader_llm_name, format="json", temperature=0
)

this works fine if the format is not specified. i am getting this error when setting format='json', usually the first call is fine but an immediate sequential call causes this error.

#3154
#3860

ollama version: 0.1.32
os: ubuntu 18.04
Nvidia GPU

<!-- gh-comment-id:2074819412 --> @sooraj12 commented on GitHub (Apr 24, 2024): this has something to do with setting format="json" in llm model. ChatOllama( base_url=base_url, model=grader_llm_name, format="json", temperature=0 ) this works fine if the format is not specified. i am getting this error when setting format='json', usually the first call is fine but an immediate sequential call causes this error. #3154 #3860 ollama version: 0.1.32 os: ubuntu 18.04 Nvidia GPU
Author
Owner

@navrig commented on GitHub (Apr 24, 2024):

I'm also experiencing this error consistently when using format="json".

If a long enough delay is introduced after each call to the llm, before making the next call to Ollama, the problem doesn't show up.

So it appears that the format is introducing something that means that there is a significant time between when the server finishes it's response and everything is in a state ready to receive another request.

Ollama: 0.1.32
OS: Ubuntu 22.04.4 LTS
2 x Nvidia GPU

I implemented a retry mechanism and show a typical response below. It's not always the same, but it's consistently reproducible.

---ROUTE QUESTION---
What are the types of agent memory?
{'datasource': 'vectorstore'}
vectorstore
---ROUTE QUESTION TO RAG---
---RETRIEVE---
'Finished running: retrieve:'
---CHECK DOCUMENT RELEVANCE TO QUESTION---
---GRADE: DOCUMENT RELEVANT---
Attempt 1 failed with error: Ollama call failed with status code 400. Details: {"error":"unexpected server status: 1"}
Retrying...
---GRADE: DOCUMENT RELEVANT---
---GRADE: DOCUMENT RELEVANT---
Attempt 1 failed with error: Ollama call failed with status code 400. Details: {"error":"unexpected server status: 1"}
Retrying...
---GRADE: DOCUMENT RELEVANT---
---ASSESS GRADED DOCUMENTS---
---DECISION: GENERATE---
'Finished running: grade_documents:'
---GENERATE---
Attempt 1 failed with error: Ollama call failed with status code 400. Details: {"error":"unexpected server status: 1"}
Retrying...
---CHECK HALLUCINATIONS---
---DECISION: GENERATION IS GROUNDED IN DOCUMENTS---
---GRADE GENERATION vs QUESTION---
Attempt 1 failed with error: Ollama call failed with status code 400. Details: {"error":"unexpected server status: 1"}
Retrying...

<!-- gh-comment-id:2075773485 --> @navrig commented on GitHub (Apr 24, 2024): I'm also experiencing this error consistently when using format="json". If a long enough delay is introduced after each call to the llm, before making the next call to Ollama, the problem doesn't show up. So it appears that the format is introducing something that means that there is a significant time between when the server finishes it's response and everything is in a state ready to receive another request. Ollama: 0.1.32 OS: Ubuntu 22.04.4 LTS 2 x Nvidia GPU I implemented a retry mechanism and show a typical response below. It's not always the same, but it's consistently reproducible. ---ROUTE QUESTION--- What are the types of agent memory? {'datasource': 'vectorstore'} vectorstore ---ROUTE QUESTION TO RAG--- ---RETRIEVE--- 'Finished running: retrieve:' ---CHECK DOCUMENT RELEVANCE TO QUESTION--- ---GRADE: DOCUMENT RELEVANT--- Attempt 1 failed with error: Ollama call failed with status code 400. Details: {"error":"unexpected server status: 1"} Retrying... ---GRADE: DOCUMENT RELEVANT--- ---GRADE: DOCUMENT RELEVANT--- Attempt 1 failed with error: Ollama call failed with status code 400. Details: {"error":"unexpected server status: 1"} Retrying... ---GRADE: DOCUMENT RELEVANT--- ---ASSESS GRADED DOCUMENTS--- ---DECISION: GENERATE--- 'Finished running: grade_documents:' ---GENERATE--- Attempt 1 failed with error: Ollama call failed with status code 400. Details: {"error":"unexpected server status: 1"} Retrying... ---CHECK HALLUCINATIONS--- ---DECISION: GENERATION IS GROUNDED IN DOCUMENTS--- ---GRADE GENERATION vs QUESTION--- Attempt 1 failed with error: Ollama call failed with status code 400. Details: {"error":"unexpected server status: 1"} Retrying...
Author
Owner

@Shiyaoa commented on GitHub (Apr 25, 2024):

Thank you @sooraj12 , @navrig for sharing and suggesting.
I have tried using the OpenAI-style API and found that it was indeed an issue with the JSON format response. When I commented out the parameter response_format={"type": "json_object"}, the Ollama API started working properly.

client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='ollama', # required, but unused
)
    completion = client.chat.completions.create(
        model = "llama3:8b-instruct-q8_0",
        #response_format={"type": "json_object"},
        messages=messages,
        temperature=0,
        top_p=1,
 )

I'm not quite clear on how Ollama handles JSON mode, but I noticed that when JSON mode is enabled, the processing and response time of the API becomes very long, about 10 times that of non-JSON mode. This observation is similar to #3851.
In non-JSON mode, my GPU utilization can reach 80% to 100%, whereas in JSON mode, the GPU utilization is only about 15%. (My GPU is RTX 4080 Laptop). I hope the Ollama development team can optimize this issue.

<!-- gh-comment-id:2076984383 --> @Shiyaoa commented on GitHub (Apr 25, 2024): Thank you @sooraj12 , @navrig for sharing and suggesting. I have tried using the OpenAI-style API and found that it was indeed an issue with the JSON format response. When I commented out the parameter response_format={"type": "json_object"}, the Ollama API started working properly. ``` client = OpenAI( base_url = 'http://localhost:11434/v1', api_key='ollama', # required, but unused ) completion = client.chat.completions.create( model = "llama3:8b-instruct-q8_0", #response_format={"type": "json_object"}, messages=messages, temperature=0, top_p=1, ) ``` I'm not quite clear on how Ollama handles JSON mode, but I noticed that when JSON mode is enabled, the processing and response time of the API becomes very long, about 10 times that of non-JSON mode. This observation is similar to #3851. In non-JSON mode, my GPU utilization can reach 80% to 100%, whereas in JSON mode, the GPU utilization is only about 15%. (My GPU is RTX 4080 Laptop). I hope the Ollama development team can optimize this issue.
Author
Owner

@RazyRo commented on GitHub (Jun 20, 2024):

?

<!-- gh-comment-id:2180022213 --> @RazyRo commented on GitHub (Jun 20, 2024): ?
Author
Owner

@ZEUSREY commented on GitHub (Jan 6, 2025):

3.2 latest works well llama vision only with CPUs

<!-- gh-comment-id:2572228658 --> @ZEUSREY commented on GitHub (Jan 6, 2025): 3.2 latest works well llama vision only with CPUs
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#28141