[GH-ISSUE #7526] 500 Error in Ollama API After LLM Computation Exceeds 2 Minutes #4787

New Issue

GiteaMirror · 2026-04-12T15:44:35-05:00

GiteaMirror commented

2026-04-12 15:44:35 -05:00

Originally created by @ALLMI78 on GitHub (Nov 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7526

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Sorry, I'm new to github but i got this problem and no solution...

When processing requests with any model in Ollama, a 500 Internal Server Error consistently occurs whenever the LLM computation exceeds exactly 2 minutes. This happens regardless of the model size or GPU/CPU usage, indicating a strict runtime limit. Notably, if the model completes processing under 2 minutes, the response returns without error.

Observed Behavior:
The API returns a 500 error precisely at the 2-minute mark, interrupting the LLM’s processing. Debug logs show no specific timeout warnings or errors related to resource limits. No documented configuration settings appear available to adjust this runtime limit.

Expected Behavior:
Ability to configure or bypass the 2-minute processing timeout to allow longer LLM computations, or receive more detailed error feedback regarding timeout settings.

Debug Attempts:
Verified high debug level (OLLAMA_DEBUG=true).
Tested with models of various sizes (confirming 70% VRAM usage or less).
Checked for relevant timeout settings in logs and source files without success.
Searched for relevant timeout settings in Ollama’s documentation and codebase but found no configurable options related to runtime limits.

Environment:
System: WIN 10 64 bit / 4060 TI 16 GB / 32 GB Ram
Ollama version: 0.3.14

DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=4 tid="8572" timestamp=1730876370 DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=5 tid="8572" timestamp=1730876370 DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=8431 slot_id=0 task_id=5 tid="8572" timestamp=1730876370 DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=5 tid="8572" timestamp=1730876370 time=2024-11-06T08:01:27.258+01:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-11-06T08:01:27.259+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s time=2024-11-06T08:01:27.259+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0 [GIN] 2024/11/06 - 08:01:27 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat" time=2024-11-06T08:01:27.285+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d DEBUG [process_single_task] slot data | n_idle_slots=0 n_processing_slots=1 task_id=2921 tid="8572" timestamp=1730876487 DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=56096 status=200 tid="1200" timestamp=1730876487 DEBUG [update_slots] slot released | n_cache_tokens=11346 n_ctx=32768 n_past=11345 n_system_tokens=0 slot_id=0 task_id=5 tid="8572" timestamp=1730876487 truncated=false DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2924 tid="8572" timestamp=1730876487 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=56099 status=200 tid="9824" timestamp=1730876487 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2925 tid="8572" timestamp=1730876487 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=56099 status=200 tid="9824" timestamp=1730876487 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2926 tid="8572" timestamp=1730876487 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=56100 status=200 tid="8676" timestamp=1730876487

The first time the error occurs, the Ollama API does not return anything, only logging the error 500 message. After this, the model continues running, but if it doesn't return within another 2 minutes, the 500 Internal Server Error is triggered again, and the process is returned.

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.3.14

Originally created by @ALLMI78 on GitHub (Nov 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7526 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Sorry, I'm new to github but i got this problem and no solution... When processing requests with any model in Ollama, a 500 Internal Server Error consistently occurs whenever the LLM computation exceeds exactly 2 minutes. This happens regardless of the model size or GPU/CPU usage, indicating a strict runtime limit. Notably, if the model completes processing under 2 minutes, the response returns without error. Observed Behavior: The API returns a 500 error precisely at the 2-minute mark, interrupting the LLM’s processing. Debug logs show no specific timeout warnings or errors related to resource limits. No documented configuration settings appear available to adjust this runtime limit. Expected Behavior: Ability to configure or bypass the 2-minute processing timeout to allow longer LLM computations, or receive more detailed error feedback regarding timeout settings. Debug Attempts: Verified high debug level (OLLAMA_DEBUG=true). Tested with models of various sizes (confirming 70% VRAM usage or less). Checked for relevant timeout settings in logs and source files without success. Searched for relevant timeout settings in Ollama’s documentation and codebase but found no configurable options related to runtime limits. Environment: System: WIN 10 64 bit / 4060 TI 16 GB / 32 GB Ram Ollama version: 0.3.14 `DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=4 tid="8572" timestamp=1730876370 DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=5 tid="8572" timestamp=1730876370 DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=8431 slot_id=0 task_id=5 tid="8572" timestamp=1730876370 DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=5 tid="8572" timestamp=1730876370 time=2024-11-06T08:01:27.258+01:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-11-06T08:01:27.259+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s time=2024-11-06T08:01:27.259+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0 [GIN] 2024/11/06 - 08:01:27 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat" time=2024-11-06T08:01:27.285+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d DEBUG [process_single_task] slot data | n_idle_slots=0 n_processing_slots=1 task_id=2921 tid="8572" timestamp=1730876487 DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=56096 status=200 tid="1200" timestamp=1730876487 DEBUG [update_slots] slot released | n_cache_tokens=11346 n_ctx=32768 n_past=11345 n_system_tokens=0 slot_id=0 task_id=5 tid="8572" timestamp=1730876487 truncated=false DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2924 tid="8572" timestamp=1730876487 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=56099 status=200 tid="9824" timestamp=1730876487 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2925 tid="8572" timestamp=1730876487 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=56099 status=200 tid="9824" timestamp=1730876487 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2926 tid="8572" timestamp=1730876487 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=56100 status=200 tid="8676" timestamp=1730876487` The first time the error occurs, the Ollama API does not return anything, only logging the error 500 message. After this, the model continues running, but if it doesn't return within another 2 minutes, the 500 Internal Server Error is triggered again, and the process is returned. ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.14

GiteaMirror added the needs more info bug labels 2026-04-12 15:44:36 -05:00

GiteaMirror closed this issue

2026-04-12 15:44:37 -05:00

GiteaMirror commented

2026-04-12 15:44:38 -05:00

@rick-github commented on GitHub (Nov 6, 2024):

Your client has a two minute timeout.

@rick-github commented on GitHub (Nov 6, 2024): Your client has a two minute timeout.

GiteaMirror commented

2026-04-12 15:44:40 -05:00

@ALLMI78 commented on GitHub (Nov 6, 2024):

Thank you for your response. However, I believe there’s a misunderstanding regarding the timeout. My client is set to allow up to 180 seconds (3 minutes) for the request, as seen in the code:

string endpoint = "http://127.0.0.1:11434/api/chat";

int awc = WebRequest("POST", endpoint, header, 180, postDataUTF8, result, responseHeader);

And changing this does not help...

So, the issue does not originate from the client-side timeout. It seems the problem is on the server side, where a hard limit of 2 minutes runtime may be causing the error 500 to trigger if the model takes longer than that to compute a response. Can you confirm whether the server has a 2-minute timeout or any other limit that would cause this issue?

https://pypi.org/project/llama-index-llms-ollama/ has a request_timeout for that i think...

@ALLMI78 commented on GitHub (Nov 6, 2024): Thank you for your response. However, I believe there’s a misunderstanding regarding the timeout. My client is set to allow up to 180 seconds (3 minutes) for the request, as seen in the code: ``` string endpoint = "http://127.0.0.1:11434/api/chat"; int awc = WebRequest("POST", endpoint, header, 180, postDataUTF8, result, responseHeader); ``` And changing this does not help... So, the issue does not originate from the client-side timeout. It seems the problem is on the server side, where a hard limit of 2 minutes runtime may be causing the error 500 to trigger if the model takes longer than that to compute a response. Can you confirm whether the server has a 2-minute timeout or any other limit that would cause this issue? https://pypi.org/project/llama-index-llms-ollama/ has a request_timeout for that i think...

GiteaMirror commented

2026-04-12 15:44:40 -05:00

@rick-github commented on GitHub (Nov 6, 2024):

$ time curl localhost:11434/api/generate -d '{"model":"llama3.1:8b-instruct-q8_0","options":{"num_gpu":0},"prompt":"write a long story about a unicorn lost in a rainforst, who meets a polar bear and a penuguin."}'
{"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:19:00.564505878Z","response":"In","done":false}
{"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:19:00.70584441Z","response":" a","done":false}
{"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:19:00.855673346Z","response":" dense","done":false}
...
{"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:21:41.074152235Z","response":".","done":false}
{"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:21:41.208221505Z","response":"","done":true,"done_reason":"stop","context":[128009,128006,882,...,315,27607,6137,311,28572,13],"total_duration":161495407350,"load_duration":30334362,"prompt_eval_count":36,"prompt_eval_duration":779799000,"eval_count":1153,"eval_duration":160643649000}

real    2m41.556s
user    0m0.028s
sys     0m0.057s

ollama  | [GIN] 2024/11/06 - 15:21:41 | 200 |         2m41s |   192.168.192.1 | POST     "/api/generate"

@rick-github commented on GitHub (Nov 6, 2024): ```console $ time curl localhost:11434/api/generate -d '{"model":"llama3.1:8b-instruct-q8_0","options":{"num_gpu":0},"prompt":"write a long story about a unicorn lost in a rainforst, who meets a polar bear and a penuguin."}' {"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:19:00.564505878Z","response":"In","done":false} {"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:19:00.70584441Z","response":" a","done":false} {"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:19:00.855673346Z","response":" dense","done":false} ... {"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:21:41.074152235Z","response":".","done":false} {"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:21:41.208221505Z","response":"","done":true,"done_reason":"stop","context":[128009,128006,882,...,315,27607,6137,311,28572,13],"total_duration":161495407350,"load_duration":30334362,"prompt_eval_count":36,"prompt_eval_duration":779799000,"eval_count":1153,"eval_duration":160643649000} real 2m41.556s user 0m0.028s sys 0m0.057s ``` ``` ollama | [GIN] 2024/11/06 - 15:21:41 | 200 | 2m41s | 192.168.192.1 | POST "/api/generate" ```

GiteaMirror commented

2026-04-12 15:44:41 -05:00

@rick-github commented on GitHub (Nov 6, 2024):

Same for chat endpoint. AFAIK there's nothing on the server side that has a two minute timeout.

$ time curl localhost:11434/api/chat -d '{"model":"llama3.1:8b-instruct-q8_0","options":{"num_gpu":0},"messages":[{"role":"user","content":"write a long story about a unicorn lost in a rainforst, who meets a polar bear and a penuguin."}]}'
{"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:30:21.092029369Z","message":{"role":"assistant","content":"Once"},"done":false}
{"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:30:21.222103747Z","message":{"role":"assistant","content":" upon"},"done":false}
...
{"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:33:06.153124301Z","message":{"role":"assistant","content":"."},"done":false}
{"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:33:06.294414225Z","message":{"role":"assistant","content":""},"done_reason":"stop","done":true,"total_duration":165419659457,"load_duration":29900449,"prompt_eval_count":36,"prompt_eval_duration":145282000,"eval_count":1185,"eval_duration":165202237000}

real    2m45.431s
user    0m0.030s
sys     0m0.054s

ollama  | [GIN] 2024/11/06 - 15:33:06 | 200 |         2m45s |   192.168.192.1 | POST     "/api/chat"

@rick-github commented on GitHub (Nov 6, 2024): Same for chat endpoint. AFAIK there's nothing on the server side that has a two minute timeout. ```console $ time curl localhost:11434/api/chat -d '{"model":"llama3.1:8b-instruct-q8_0","options":{"num_gpu":0},"messages":[{"role":"user","content":"write a long story about a unicorn lost in a rainforst, who meets a polar bear and a penuguin."}]}' {"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:30:21.092029369Z","message":{"role":"assistant","content":"Once"},"done":false} {"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:30:21.222103747Z","message":{"role":"assistant","content":" upon"},"done":false} ... {"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:33:06.153124301Z","message":{"role":"assistant","content":"."},"done":false} {"model":"llama3.1:8b-instruct-q8_0","created_at":"2024-11-06T15:33:06.294414225Z","message":{"role":"assistant","content":""},"done_reason":"stop","done":true,"total_duration":165419659457,"load_duration":29900449,"prompt_eval_count":36,"prompt_eval_duration":145282000,"eval_count":1185,"eval_duration":165202237000} real 2m45.431s user 0m0.030s sys 0m0.054s ``` ``` ollama | [GIN] 2024/11/06 - 15:33:06 | 200 | 2m45s | 192.168.192.1 | POST "/api/chat" ```

GiteaMirror commented

2026-04-12 15:44:41 -05:00

@ALLMI78 commented on GitHub (Nov 6, 2024):

I see many thanks, but you did that on cpu only - can you please check this on gpu - is there a gpu runtime limit?

Is it possible that stream true / false does influence something here?

I use num_gpu=100, stream=false...

@ALLMI78 commented on GitHub (Nov 6, 2024): I see many thanks, but you did that on cpu only - can you please check this on gpu - is there a gpu runtime limit? Is it possible that stream true / false does influence something here? I use num_gpu=100, stream=false...

GiteaMirror commented

2026-04-12 15:44:43 -05:00

@ALLMI78 commented on GitHub (Nov 6, 2024):

There are two things that puzzle me:

When the Ollama server logs the 500 error for the first time, my web request does not return yet. The server somehow restarts the model, which is noticeable when I set keep_alive to 0. The model quickly unloads and reloads, while the web request still does not return. In the first run, before the first error 500, the GPU runs quietly and normally. After the first 500 error and the internal reload, the GPU becomes significantly louder in the second run. If the response then takes longer than 2 minutes again, the Ollama server cancels, and my web request returns with an error.

I am surprised that I can't find anything about this. It seems like no one else has these concerns, but this exact termination after 2 minutes confuses me.

Most of the time, it works as expected: i get a resposne text with status 200 and if the 500 error appears first, then the model runs again, generates its response, and returns with a status 200. However, when this doesn't work, and the 2-minute limit is reached twice, after the second 500 error, the web request returns with an error.

@ALLMI78 commented on GitHub (Nov 6, 2024): There are two things that puzzle me: When the Ollama server logs the 500 error for the first time, my web request does not return yet. The server somehow restarts the model, which is noticeable when I set keep_alive to 0. The model quickly unloads and reloads, while the web request still does not return. In the first run, before the first error 500, the GPU runs quietly and normally. After the first 500 error and the internal reload, the GPU becomes significantly louder in the second run. If the response then takes longer than 2 minutes again, the Ollama server cancels, and my web request returns with an error. I am surprised that I can't find anything about this. It seems like no one else has these concerns, but this exact termination after 2 minutes confuses me. Most of the time, it works as expected: i get a resposne text with status 200 and if the 500 error appears first, then the model runs again, generates its response, and returns with a status 200. However, when this doesn't work, and the 2-minute limit is reached twice, after the second 500 error, the web request returns with an error.

GiteaMirror commented

2026-04-12 15:44:44 -05:00

@rick-github commented on GitHub (Nov 6, 2024):

If I do it on GPU, it finishes in less than 30 seconds, so doesn't exceed two minutes. Setting "stream":false still takes over two minutes.

What hardware do you have that can load 33 layers into GPU but still takes more than 2 minutes for a completion?

What language is your client written in? The examples of WebRequest I find on the internet have the timeout in milliseconds, so your 180 is not 3 minutes, it's 0.18 seconds.

Do you run a proxy? If so, have you set noproxy=127.0.0.1 in the client environment?

Server logs may shed some light.

@rick-github commented on GitHub (Nov 6, 2024): If I do it on GPU, it finishes in less than 30 seconds, so doesn't exceed two minutes. Setting `"stream":false` still takes over two minutes. What hardware do you have that can load 33 layers into GPU but still takes more than 2 minutes for a completion? What language is your client written in? The examples of `WebRequest` I find on the internet have the timeout in milliseconds, so your `180` is not 3 minutes, it's 0.18 seconds. Do you run a proxy? If so, have you set `noproxy=127.0.0.1` in the client environment? [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may shed some light.

GiteaMirror commented

2026-04-12 15:44:44 -05:00

@ALLMI78 commented on GitHub (Nov 6, 2024):

So, does the process on your end also run for more than 2 minutes on the GPU?

What exactly do you want to know? i7 3770, 32 GB RAM, 4060 ti 16 GB VRAM, Win 10 64 bit.

The client is written in MQL5, and you are right that the timeout for the MQL5 WebRequest is actually in milliseconds, but this is only a connection timeout ("timeout int in Sets the maximum time to wait for a response from the server, in milliseconds"), and if that were the cause, I wouldn't even reach the 2-minute runtime.

No proxy.

Server logs, happy to share, I’ve already included them above. They are quite long since I’m using 32k context, but if you tell me what you want to see, I can show you more. However, there's nothing specific in them...

DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2927 tid="8572" timestamp=1730876487
DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",19409]
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2928 tid="8572" timestamp=1730876487
DEBUG [update_slots] slot progression | ga_i=0 n_past=8431 n_past_se=0 n_prompt_tokens_processed=8431 slot_id=0 task_id=2928 tid="8572" timestamp=1730876487
DEBUG [update_slots] we have to evaluate at least 1 token to generate logits | slot_id=0 task_id=2928 tid="8572" timestamp=1730876487
DEBUG [update_slots] kv cache rm [p0, end) | p0=8430 slot_id=0 task_id=2928 tid="8572" timestamp=1730876487
time=2024-11-06T08:03:27.276+01:00 level=DEBUG source=sched.go:407 msg="context for request finished"
time=2024-11-06T08:03:27.276+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s
time=2024-11-06T08:03:27.276+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0
[GIN] 2024/11/06 - 08:03:27 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat"
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=56100 status=200 tid="8676" timestamp=1730876607
DEBUG [update_slots] slot released | n_cache_tokens=11536 n_ctx=32768 n_past=11535 n_system_tokens=0 slot_id=0 task_id=2928 tid="8572" timestamp=1730876607 truncated=false
time=2024-11-06T08:03:32.292+01:00 level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-06T08:03:32.292+01:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-06T08:03:32.292+01:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-06T08:03:32.292+01:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="31.0 GiB" before.free="26.0 GiB" before.free_swap="29.4 GiB" now.total="31.0 GiB" now.free="25.2 GiB" now.free_swap="16.8 GiB"
time=2024-11-06T08:03:32.307+01:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-cf79912c-84b7-d47e-b92c-67fd3713592f name="NVIDIA GeForce RTX 4060 Ti" overhead="675.2 MiB" before.total="16.0 GiB" before.free="14.9 GiB" now.total="16.0 GiB" now.free="3.2 GiB" now.used="12.1 GiB"
time=2024-11-06T08:03:32.308+01:00 level=DEBUG source=server.go:1086 msg="stopping llama server"
time=2024-11-06T08:03:32.308+01:00 level=DEBUG source=server.go:1092 msg="waiting for llama server to exit"
time=2024-11-06T08:03:32.569+01:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="31.0 GiB" before.free="25.2 GiB" before.free_swap="16.8 GiB" now.total="31.0 GiB" now.free="25.5 GiB" now.free_swap="28.4 GiB"
time=2024-11-06T08:03:32.590+01:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-cf79912c-84b7-d47e-b92c-67fd3713592f name="NVIDIA GeForce RTX 4060 Ti" overhead="675.2 MiB" before.total="16.0 GiB" before.free="3.2 GiB" now.total="16.0 GiB" now.free="14.9 GiB" now.used="471.6 MiB"
time=2024-11-06T08:03:32.590+01:00 level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.30 seconds" model=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-06T08:03:32.667+01:00 level=DEBUG source=server.go:1096 msg="llama server stopped"
time=2024-11-06T08:03:32.667+01:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-06T08:03:32.667+01:00 level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-06T08:03:32.667+01:00 level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests"

@ALLMI78 commented on GitHub (Nov 6, 2024): So, does the process on your end also run for more than 2 minutes on the GPU? What exactly do you want to know? i7 3770, 32 GB RAM, 4060 ti 16 GB VRAM, Win 10 64 bit. The client is written in MQL5, and you are right that the timeout for the MQL5 WebRequest is actually in milliseconds, but this is only a connection timeout ("timeout int in Sets the maximum time to wait for a response from the server, in milliseconds"), and if that were the cause, I wouldn't even reach the 2-minute runtime. No proxy. Server logs, happy to share, I’ve already included them above. They are quite long since I’m using 32k context, but if you tell me what you want to see, I can show you more. However, there's nothing specific in them... > DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2927 tid="8572" timestamp=1730876487 DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",19409] DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2928 tid="8572" timestamp=1730876487 DEBUG [update_slots] slot progression | ga_i=0 n_past=8431 n_past_se=0 n_prompt_tokens_processed=8431 slot_id=0 task_id=2928 tid="8572" timestamp=1730876487 DEBUG [update_slots] we have to evaluate at least 1 token to generate logits | slot_id=0 task_id=2928 tid="8572" timestamp=1730876487 DEBUG [update_slots] kv cache rm [p0, end) | p0=8430 slot_id=0 task_id=2928 tid="8572" timestamp=1730876487 time=2024-11-06T08:03:27.276+01:00 level=DEBUG source=sched.go:407 msg="context for request finished" time=2024-11-06T08:03:27.276+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s time=2024-11-06T08:03:27.276+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0 **[GIN] 2024/11/06 - 08:03:27 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat"** DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=56100 status=200 tid="8676" timestamp=1730876607 DEBUG [update_slots] slot released | n_cache_tokens=11536 n_ctx=32768 n_past=11535 n_system_tokens=0 slot_id=0 task_id=2928 tid="8572" timestamp=1730876607 truncated=false time=2024-11-06T08:03:32.292+01:00 level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d time=2024-11-06T08:03:32.292+01:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d time=2024-11-06T08:03:32.292+01:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d time=2024-11-06T08:03:32.292+01:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="31.0 GiB" before.free="26.0 GiB" before.free_swap="29.4 GiB" now.total="31.0 GiB" now.free="25.2 GiB" now.free_swap="16.8 GiB" time=2024-11-06T08:03:32.307+01:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-cf79912c-84b7-d47e-b92c-67fd3713592f name="NVIDIA GeForce RTX 4060 Ti" overhead="675.2 MiB" before.total="16.0 GiB" before.free="14.9 GiB" now.total="16.0 GiB" now.free="3.2 GiB" now.used="12.1 GiB" time=2024-11-06T08:03:32.308+01:00 level=DEBUG source=server.go:1086 msg="stopping llama server" time=2024-11-06T08:03:32.308+01:00 level=DEBUG source=server.go:1092 msg="waiting for llama server to exit" time=2024-11-06T08:03:32.569+01:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="31.0 GiB" before.free="25.2 GiB" before.free_swap="16.8 GiB" now.total="31.0 GiB" now.free="25.5 GiB" now.free_swap="28.4 GiB" time=2024-11-06T08:03:32.590+01:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-cf79912c-84b7-d47e-b92c-67fd3713592f name="NVIDIA GeForce RTX 4060 Ti" overhead="675.2 MiB" before.total="16.0 GiB" before.free="3.2 GiB" now.total="16.0 GiB" now.free="14.9 GiB" now.used="471.6 MiB" time=2024-11-06T08:03:32.590+01:00 level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.30 seconds" model=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d time=2024-11-06T08:03:32.667+01:00 level=DEBUG source=server.go:1096 msg="llama server stopped" time=2024-11-06T08:03:32.667+01:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d time=2024-11-06T08:03:32.667+01:00 level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d time=2024-11-06T08:03:32.667+01:00 level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests"

GiteaMirror commented

2026-04-12 15:44:45 -05:00

@ALLMI78 commented on GitHub (Nov 6, 2024):

DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=4 tid="748" timestamp=1730866361
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=5 tid="748" timestamp=1730866361
DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=9860 slot_id=0 task_id=5 tid="748" timestamp=1730866361
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=5 tid="748" timestamp=1730866361
time=2024-11-06T05:14:37.578+01:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2024-11-06T05:14:37.578+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s
time=2024-11-06T05:14:37.578+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0
[GIN] 2024/11/06 - 05:14:37 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat"
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=52582 status=200 tid="2552" timestamp=1730866477
time=2024-11-06T05:14:37.603+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
DEBUG [process_single_task] slot data | n_idle_slots=0 n_processing_slots=1 task_id=2810 tid="748" timestamp=1730866477
DEBUG [update_slots] slot released | n_cache_tokens=12663 n_ctx=32768 n_past=12662 n_system_tokens=0 slot_id=0 task_id=5 tid="748" timestamp=1730866477 truncated=false
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2812 tid="748" timestamp=1730866477
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=52586 status=200 tid="8380" timestamp=1730866477
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2813 tid="748" timestamp=1730866477
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=52586 status=200 tid="8380" timestamp=1730866477
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2814 tid="748" timestamp=1730866477
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=52587 status=200 tid="5056" timestamp=1730866477
time=2024-11-06T05:14:37.671+01:00 level=DEBUG source=routes.go:1422 msg="chat request" images=0 prompt=

cut long prompt with some messages - its the same prompt again like the one before that raises the 500 error

DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2815 tid="748" timestamp=1730866477
DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",23708]
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2816 tid="748" timestamp=1730866477
DEBUG [update_slots] slot progression | ga_i=0 n_past=9860 n_past_se=0 n_prompt_tokens_processed=9860 slot_id=0 task_id=2816 tid="748" timestamp=1730866477
DEBUG [update_slots] we have to evaluate at least 1 token to generate logits | slot_id=0 task_id=2816 tid="748" timestamp=1730866477
DEBUG [update_slots] kv cache rm [p0, end) | p0=9859 slot_id=0 task_id=2816 tid="748" timestamp=1730866477
DEBUG [print_timings] prompt eval time = 52.65 ms / 9860 tokens ( 0.01 ms per token, 187260.23 tokens per second) | n_prompt_tokens_processed=9860 n_tokens_second=187260.22714323696 slot_id=0 t_prompt_processing=52.654 t_token=0.005340162271805274 task_id=2816 tid="748" timestamp=1730866528
DEBUG [print_timings] generation eval time = 51011.85 ms / 1302 runs ( 39.18 ms per token, 25.52 tokens per second) | n_decoded=1302 n_tokens_second=25.523479307378242 slot_id=0 t_token=39.17961136712749 t_token_generation=51011.854 task_id=2816 tid="748" timestamp=1730866528
DEBUG [print_timings] total time = 51064.51 ms | slot_id=0 t_prompt_processing=52.654 t_token_generation=51011.854 t_total=51064.508 task_id=2816 tid="748" timestamp=1730866528
DEBUG [update_slots] slot released | n_cache_tokens=11162 n_ctx=32768 n_past=11161 n_system_tokens=0 slot_id=0 task_id=2816 tid="748" timestamp=1730866528 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=52587 status=200 tid="5056" timestamp=1730866528
[GIN] 2024/11/06 - 05:15:28 | 200 | 51.1594552s | 127.0.0.1 | POST "/api/chat"
time=2024-11-06T05:15:28.738+01:00 level=DEBUG source=sched.go:407 msg="context for request finished"
time=2024-11-06T05:15:28.738+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s
time=2024-11-06T05:15:28.738+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0
time=2024-11-06T05:15:28.740+01:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-06T05:15:28.740+01:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-06T05:15:28.740+01:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="31.0 GiB" before.free="25.9 GiB" before.free_swap="29.3 GiB" now.total="31.0 GiB" now.free="25.1 GiB" now.free_swap="16.7 GiB"

@ALLMI78 commented on GitHub (Nov 6, 2024): > DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=4 tid="748" timestamp=1730866361 DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=5 tid="748" timestamp=1730866361 DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=9860 slot_id=0 task_id=5 tid="748" timestamp=1730866361 DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=5 tid="748" timestamp=1730866361 time=2024-11-06T05:14:37.578+01:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-11-06T05:14:37.578+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s time=2024-11-06T05:14:37.578+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0 **[GIN] 2024/11/06 - 05:14:37 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat"** DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=52582 status=200 tid="2552" timestamp=1730866477 time=2024-11-06T05:14:37.603+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d DEBUG [process_single_task] slot data | n_idle_slots=0 n_processing_slots=1 task_id=2810 tid="748" timestamp=1730866477 DEBUG [update_slots] slot released | n_cache_tokens=12663 n_ctx=32768 n_past=12662 n_system_tokens=0 slot_id=0 task_id=5 tid="748" timestamp=1730866477 truncated=false DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2812 tid="748" timestamp=1730866477 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=52586 status=200 tid="8380" timestamp=1730866477 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2813 tid="748" timestamp=1730866477 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=52586 status=200 tid="8380" timestamp=1730866477 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2814 tid="748" timestamp=1730866477 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=52587 status=200 tid="5056" timestamp=1730866477 time=2024-11-06T05:14:37.671+01:00 level=DEBUG source=routes.go:1422 msg="chat request" images=0 prompt= > cut long prompt with some messages - its the same prompt again like the one before that raises the 500 error > DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2815 tid="748" timestamp=1730866477 DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",23708] DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2816 tid="748" timestamp=1730866477 DEBUG [update_slots] slot progression | ga_i=0 n_past=9860 n_past_se=0 n_prompt_tokens_processed=9860 slot_id=0 task_id=2816 tid="748" timestamp=1730866477 DEBUG [update_slots] we have to evaluate at least 1 token to generate logits | slot_id=0 task_id=2816 tid="748" timestamp=1730866477 DEBUG [update_slots] kv cache rm [p0, end) | p0=9859 slot_id=0 task_id=2816 tid="748" timestamp=1730866477 DEBUG [print_timings] prompt eval time = 52.65 ms / 9860 tokens ( 0.01 ms per token, 187260.23 tokens per second) | n_prompt_tokens_processed=9860 n_tokens_second=187260.22714323696 slot_id=0 t_prompt_processing=52.654 t_token=0.005340162271805274 task_id=2816 tid="748" timestamp=1730866528 DEBUG [print_timings] generation eval time = 51011.85 ms / 1302 runs ( 39.18 ms per token, 25.52 tokens per second) | n_decoded=1302 n_tokens_second=25.523479307378242 slot_id=0 t_token=39.17961136712749 t_token_generation=51011.854 task_id=2816 tid="748" timestamp=1730866528 DEBUG [print_timings] total time = 51064.51 ms | slot_id=0 t_prompt_processing=52.654 t_token_generation=51011.854 t_total=51064.508 task_id=2816 tid="748" timestamp=1730866528 DEBUG [update_slots] slot released | n_cache_tokens=11162 n_ctx=32768 n_past=11161 n_system_tokens=0 slot_id=0 task_id=2816 tid="748" timestamp=1730866528 truncated=false DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=52587 status=200 tid="5056" timestamp=1730866528 **[GIN] 2024/11/06 - 05:15:28 | 200 | 51.1594552s | 127.0.0.1 | POST "/api/chat"** time=2024-11-06T05:15:28.738+01:00 level=DEBUG source=sched.go:407 msg="context for request finished" time=2024-11-06T05:15:28.738+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s time=2024-11-06T05:15:28.738+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0 time=2024-11-06T05:15:28.740+01:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d time=2024-11-06T05:15:28.740+01:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d time=2024-11-06T05:15:28.740+01:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="31.0 GiB" before.free="25.9 GiB" before.free_swap="29.3 GiB" now.total="31.0 GiB" now.free="25.1 GiB" now.free_swap="16.7 GiB"

GiteaMirror commented

2026-04-12 15:44:46 -05:00

@ALLMI78 commented on GitHub (Nov 6, 2024):

first error 500, it does not stop and next try is 200

i have only removed my prompt text

time=2024-11-06T05:02:12.626+01:00 level=DEBUG source=routes.go:1422 msg="chat request" images=0 prompt=""
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2 tid="2836" timestamp=1730865732
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=3 tid="2836" timestamp=1730865732
DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=6454 slot_id=0 task_id=3 tid="2836" timestamp=1730865732
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=3 tid="2836" timestamp=1730865732
time=2024-11-06T05:04:08.703+01:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2024-11-06T05:04:08.703+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-865eafa86cd6cb786964debf25310013bae93e772e783d8a7b54a14608bda9a3 duration=5s
time=2024-11-06T05:04:08.703+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-865eafa86cd6cb786964debf25310013bae93e772e783d8a7b54a14608bda9a3 refCount=0
[GIN] 2024/11/06 - 05:04:08 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat"
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=52526 status=200 tid="9632" timestamp=1730865848
time=2024-11-06T05:04:08.726+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=M:\OLLAMA\models\blobs\sha256-865eafa86cd6cb786964debf25310013bae93e772e783d8a7b54a14608bda9a3
DEBUG [process_single_task] slot data | n_idle_slots=0 n_processing_slots=1 task_id=2253 tid="2836" timestamp=1730865848
DEBUG [update_slots] slot released | n_cache_tokens=8702 n_ctx=32768 n_past=8701 n_system_tokens=0 slot_id=0 task_id=3 tid="2836" timestamp=1730865848 truncated=false
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2255 tid="2836" timestamp=1730865848
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=52528 status=200 tid="4964" timestamp=1730865848

time=2024-11-06T05:04:08.779+01:00 level=DEBUG source=routes.go:1422 msg="chat request" images=0 prompt=""
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2256 tid="2836" timestamp=1730865848
DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",9481]
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2257 tid="2836" timestamp=1730865848
DEBUG [update_slots] slot progression | ga_i=0 n_past=6454 n_past_se=0 n_prompt_tokens_processed=6454 slot_id=0 task_id=2257 tid="2836" timestamp=1730865848
DEBUG [update_slots] we have to evaluate at least 1 token to generate logits | slot_id=0 task_id=2257 tid="2836" timestamp=1730865848
DEBUG [update_slots] kv cache rm [p0, end) | p0=6453 slot_id=0 task_id=2257 tid="2836" timestamp=1730865848
DEBUG [print_timings] prompt eval time = 65.79 ms / 6454 tokens ( 0.01 ms per token, 98100.02 tokens per second) | n_prompt_tokens_processed=6454 n_tokens_second=98100.0151998784 slot_id=0 t_prompt_processing=65.79 t_token=0.010193678339014566 task_id=2257 tid="2836" timestamp=1730865905
DEBUG [print_timings] generation eval time = 56963.20 ms / 1175 runs ( 48.48 ms per token, 20.63 tokens per second) | n_decoded=1175 n_tokens_second=20.627352033815658 slot_id=0 t_token=48.47932 t_token_generation=56963.201 task_id=2257 tid="2836" timestamp=1730865905
DEBUG [print_timings] total time = 57028.99 ms | slot_id=0 t_prompt_processing=65.79 t_token_generation=56963.201 t_total=57028.991 task_id=2257 tid="2836" timestamp=1730865905
DEBUG [update_slots] slot released | n_cache_tokens=7629 n_ctx=32768 n_past=7628 n_system_tokens=0 slot_id=0 task_id=2257 tid="2836" timestamp=1730865905 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=52528 status=200 tid="4964" timestamp=1730865905
[GIN] 2024/11/06 - 05:05:05 | 200 | 57.1060021s | 127.0.0.1 | POST "/api/chat"
time=2024-11-06T05:05:05.810+01:00 level=DEBUG source=sched.go:407 msg="context for request finished"
time=2024-11-06T05:05:05.810+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-865eafa86cd6cb786964debf25310013bae93e772e783d8a7b54a14608bda9a3 duration=5s
time=2024-11-06T05:05:05.810+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-865eafa86cd6cb786964debf25310013bae93e772e783d8a7b54a14608bda9a3 refCount=0
[GIN] 2024/11/06 - 05:05:05 | 200 | 531.9µs | 127.0.0.1 | POST "/api/chat"
time=2024-11-06T05:05:05.812+01:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=M:\OLLAMA\models\blobs\sha256-865eafa86cd6cb786964debf25310013bae93e772e783d8a7b54a14608bda9a3
time=2024-11-06T05:05:05.812+01:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=M:\OLLAMA\models\blobs\sha256-865eafa86cd6cb786964debf25310013bae93e772e783d8a7b54a14608bda9a3
time=2024-11-06T05:05:05.812+01:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="31.0 GiB" before.free="26.0 GiB" before.free_swap="29.3 GiB" now.total="31.0 GiB" now.free="25.1 GiB" now.free_swap="13.1 GiB"
time=2024-11-06T05:05:05.833+01:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-cf79912c-84b7-d47e-b92c-67fd3713592f name="NVIDIA GeForce RTX 4060 Ti" overhead="675.2 MiB" before.total="16.0 GiB" before.free="14.9 GiB" now.total="16.0 GiB" now.free="17179869183.6 GiB" now.used="15.7 GiB"
time=2024-11-06T05:05:05.834+01:00 level=DEBUG source=server.go:1086 msg="stopping llama server"
time=2024-11-06T05:05:05.834+01:00 level=DEBUG source=server.go:1092 msg="waiting for llama server to exit"
time=2024-11-06T05:05:06.103+01:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="31.0 GiB" before.free="25.1 GiB" before.free_swap="13.1 GiB" now.total="31.0 GiB" now.free="25.4 GiB" now.free_swap="28.4 GiB"

can you see something there?

@ALLMI78 commented on GitHub (Nov 6, 2024): first error 500, it does not stop and next try is 200 i have only removed my prompt text > time=2024-11-06T05:02:12.626+01:00 level=DEBUG source=routes.go:1422 msg="chat request" images=0 prompt="" DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2 tid="2836" timestamp=1730865732 DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=3 tid="2836" timestamp=1730865732 DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=6454 slot_id=0 task_id=3 tid="2836" timestamp=1730865732 DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=3 tid="2836" timestamp=1730865732 time=2024-11-06T05:04:08.703+01:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-11-06T05:04:08.703+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-865eafa86cd6cb786964debf25310013bae93e772e783d8a7b54a14608bda9a3 duration=5s time=2024-11-06T05:04:08.703+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-865eafa86cd6cb786964debf25310013bae93e772e783d8a7b54a14608bda9a3 refCount=0 **_[GIN] 2024/11/06 - 05:04:08 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat"_** DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=52526 status=200 tid="9632" timestamp=1730865848 time=2024-11-06T05:04:08.726+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=M:\OLLAMA\models\blobs\sha256-865eafa86cd6cb786964debf25310013bae93e772e783d8a7b54a14608bda9a3 DEBUG [process_single_task] slot data | n_idle_slots=0 n_processing_slots=1 task_id=2253 tid="2836" timestamp=1730865848 DEBUG [update_slots] slot released | n_cache_tokens=8702 n_ctx=32768 n_past=8701 n_system_tokens=0 slot_id=0 task_id=3 tid="2836" timestamp=1730865848 truncated=false DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2255 tid="2836" timestamp=1730865848 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=52528 status=200 tid="4964" timestamp=1730865848 > time=2024-11-06T05:04:08.779+01:00 level=DEBUG source=routes.go:1422 msg="chat request" images=0 prompt="" DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2256 tid="2836" timestamp=1730865848 DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",9481] DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2257 tid="2836" timestamp=1730865848 DEBUG [update_slots] slot progression | ga_i=0 n_past=6454 n_past_se=0 n_prompt_tokens_processed=6454 slot_id=0 task_id=2257 tid="2836" timestamp=1730865848 DEBUG [update_slots] we have to evaluate at least 1 token to generate logits | slot_id=0 task_id=2257 tid="2836" timestamp=1730865848 DEBUG [update_slots] kv cache rm [p0, end) | p0=6453 slot_id=0 task_id=2257 tid="2836" timestamp=1730865848 DEBUG [print_timings] prompt eval time = 65.79 ms / 6454 tokens ( 0.01 ms per token, 98100.02 tokens per second) | n_prompt_tokens_processed=6454 n_tokens_second=98100.0151998784 slot_id=0 t_prompt_processing=65.79 t_token=0.010193678339014566 task_id=2257 tid="2836" timestamp=1730865905 DEBUG [print_timings] generation eval time = 56963.20 ms / 1175 runs ( 48.48 ms per token, 20.63 tokens per second) | n_decoded=1175 n_tokens_second=20.627352033815658 slot_id=0 t_token=48.47932 t_token_generation=56963.201 task_id=2257 tid="2836" timestamp=1730865905 DEBUG [print_timings] total time = 57028.99 ms | slot_id=0 t_prompt_processing=65.79 t_token_generation=56963.201 t_total=57028.991 task_id=2257 tid="2836" timestamp=1730865905 DEBUG [update_slots] slot released | n_cache_tokens=7629 n_ctx=32768 n_past=7628 n_system_tokens=0 slot_id=0 task_id=2257 tid="2836" timestamp=1730865905 truncated=false DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=52528 status=200 tid="4964" timestamp=1730865905 [GIN] 2024/11/06 - 05:05:05 | 200 | 57.1060021s | 127.0.0.1 | POST "/api/chat" time=2024-11-06T05:05:05.810+01:00 level=DEBUG source=sched.go:407 msg="context for request finished" time=2024-11-06T05:05:05.810+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-865eafa86cd6cb786964debf25310013bae93e772e783d8a7b54a14608bda9a3 duration=5s time=2024-11-06T05:05:05.810+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-865eafa86cd6cb786964debf25310013bae93e772e783d8a7b54a14608bda9a3 refCount=0 **_[GIN] 2024/11/06 - 05:05:05 | 200 | 531.9µs | 127.0.0.1 | POST "/api/chat"_** time=2024-11-06T05:05:05.812+01:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=M:\OLLAMA\models\blobs\sha256-865eafa86cd6cb786964debf25310013bae93e772e783d8a7b54a14608bda9a3 time=2024-11-06T05:05:05.812+01:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=M:\OLLAMA\models\blobs\sha256-865eafa86cd6cb786964debf25310013bae93e772e783d8a7b54a14608bda9a3 time=2024-11-06T05:05:05.812+01:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="31.0 GiB" before.free="26.0 GiB" before.free_swap="29.3 GiB" now.total="31.0 GiB" now.free="25.1 GiB" now.free_swap="13.1 GiB" time=2024-11-06T05:05:05.833+01:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-cf79912c-84b7-d47e-b92c-67fd3713592f name="NVIDIA GeForce RTX 4060 Ti" overhead="675.2 MiB" before.total="16.0 GiB" before.free="14.9 GiB" now.total="16.0 GiB" now.free="17179869183.6 GiB" now.used="15.7 GiB" time=2024-11-06T05:05:05.834+01:00 level=DEBUG source=server.go:1086 msg="stopping llama server" time=2024-11-06T05:05:05.834+01:00 level=DEBUG source=server.go:1092 msg="waiting for llama server to exit" time=2024-11-06T05:05:06.103+01:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="31.0 GiB" before.free="25.1 GiB" before.free_swap="13.1 GiB" now.total="31.0 GiB" now.free="25.4 GiB" now.free_swap="28.4 GiB" can you see something there?

GiteaMirror commented

2026-04-12 15:44:46 -05:00

@rick-github commented on GitHub (Nov 6, 2024):

My test only runs on the GPU for 20-30 seconds, because that's as long as the completion takes. If you could share a completion that your system is trying that takes more then 2 minutes, I could try it here to compare.

It's better if you just attach the full log as a file. There's configuration information at the start and calculations about model loading that are not included in the snippets above, and they may give some insight into what's going on. The model unloading/loading around the 500 is unusual, and also not shown in the snippets.

@rick-github commented on GitHub (Nov 6, 2024): My test only runs on the GPU for 20-30 seconds, because that's as long as the completion takes. If you could share a completion that your system is trying that takes more then 2 minutes, I could try it here to compare. It's better if you just attach the full log as a file. There's configuration information at the start and calculations about model loading that are not included in the snippets above, and they may give some insight into what's going on. The model unloading/loading around the 500 is unusual, and also not shown in the snippets.

GiteaMirror commented

2026-04-12 15:44:47 -05:00

@dhiltgen commented on GitHub (Nov 6, 2024):

Everywhere we return a 500 from the server, we include some error message about why. Can you make sure your client logs the body of a 500 response, and hopefully that will help explain why the server is running into a problem.

@dhiltgen commented on GitHub (Nov 6, 2024): Everywhere we return a 500 from the server, we include some error message about why. Can you make sure your client logs the body of a 500 response, and hopefully that will help explain why the server is running into a problem.

GiteaMirror commented

2026-04-12 15:44:48 -05:00

@ALLMI78 commented on GitHub (Nov 6, 2024):

error-500-1.txt

I've prepared a file, from the start of the log and one complete cycle including the 500 error, with only the prompt content removed. In this case, you can clearly see how the 500 error is triggered, but everything continues running as normal, and shortly after, the call returns with a 200 status. The strange part is that directly afterward, a second 200 status appears, and I don’t know what it's doing.

I haven’t removed anything around the 500 error from the log file; there really isn’t any additional information or details about the error. Unfortunately, I can’t share the prompts used, as they contain sensitive information, and removing that information causes the prompt to no longer work.

In another cycle after a 500 error, two 200 statuses also appear one right after the other—check out the timestamps:

[GIN] 2024/11/06 - 05:45:12 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat"
[GIN] 2024/11/06 - 05:46:51 | 200 | 1m38s | 127.0.0.1 | POST "/api/chat"
[GIN] 2024/11/06 - 05:46:51 | 200 | 1.0942ms | 127.0.0.1 | POST "/api/chat"

Version 0.4.0 of Ollama was just installed on my system. I’ll test with this one; maybe this issue is already resolved in this version, which is also supposed to improve support for the 4060 GPU.

I’m still considering whether this could be caused by my MQL5-WebRequest, but I’m leaning toward ‘no.’ If the WebRequest were triggering the 500 error, why does it remain connected and shortly afterward return with a 200 status? That doesn’t really make sense.

"Can you make sure your client logs the body of a 500 response, and hopefully that will help explain why the server is running into a problem."

Sure, i can do that but as far as i remember with error 500 the body was empty, i'll try that again, if the problem appears in 0.4.0 ;) thanks for your help guys, you do very well ;)

@ALLMI78 commented on GitHub (Nov 6, 2024): [error-500-1.txt](https://github.com/user-attachments/files/17653649/error-500-1.txt) I've prepared a file, from the start of the log and one complete cycle including the 500 error, with only the prompt content removed. In this case, you can clearly see how the 500 error is triggered, but everything continues running as normal, and shortly after, the call returns with a 200 status. The strange part is that directly afterward, a second 200 status appears, and I don’t know what it's doing. I haven’t removed anything around the 500 error from the log file; there really isn’t any additional information or details about the error. Unfortunately, I can’t share the prompts used, as they contain sensitive information, and removing that information causes the prompt to no longer work. In another cycle after a 500 error, two 200 statuses also appear one right after the other—check out the timestamps: [GIN] 2024/11/06 - 05:45:12 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/11/06 - 05:46:51 | 200 | 1m38s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/11/06 - 05:46:51 | 200 | 1.0942ms | 127.0.0.1 | POST "/api/chat" Version 0.4.0 of Ollama was just installed on my system. I’ll test with this one; maybe this issue is already resolved in this version, which is also supposed to improve support for the 4060 GPU. I’m still considering whether this could be caused by my MQL5-WebRequest, but I’m leaning toward ‘no.’ If the WebRequest were triggering the 500 error, why does it remain connected and shortly afterward return with a 200 status? That doesn’t really make sense. **"Can you make sure your client logs the body of a 500 response, and hopefully that will help explain why the server is running into a problem."** Sure, i can do that but as far as i remember with error 500 the body was empty, i'll try that again, if the problem appears in 0.4.0 ;) thanks for your help guys, you do very well ;)

GiteaMirror commented

2026-04-12 15:44:49 -05:00

@dhiltgen commented on GitHub (Nov 6, 2024):

I'll mark it need info for now - let us know if 0.4.0 does in fact resolve the problem, or you see it happen again, and hopefully you can capture the body the client receives.

@dhiltgen commented on GitHub (Nov 6, 2024): I'll mark it need info for now - let us know if 0.4.0 does in fact resolve the problem, or you see it happen again, and hopefully you can capture the body the client receives.

GiteaMirror commented

2026-04-12 15:44:49 -05:00

@ALLMI78 commented on GitHub (Nov 6, 2024):

I’ve now tested with version 0.4.0, and the issue still persists.

I’ve isolated a part of the attached log here...

llama_model_load: vocab only - skipping tensors
time=2024-11-06T23:38:01.644+01:00 level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="###REMOVED" <<< sending my prompt
time=2024-11-06T23:38:01.659+01:00 level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=0 prompt=4171 used=0 remaining=4171
time=2024-11-06T23:39:57.304+01:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2024-11-06T23:39:57.305+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s
time=2024-11-06T23:39:57.305+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0
[GIN] 2024/11/06 - 23:39:57 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat"
time=2024-11-06T23:39:57.330+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-06T23:39:57.336+01:00 level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="###REMOVED" <<< again??? sending my prompt
time=2024-11-06T23:39:57.348+01:00 level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=7235 prompt=4171 used=4170 remaining=1
[GIN] 2024/11/06 - 23:40:52 | 200 | 55.3543941s | 127.0.0.1 | POST "/api/chat"
time=2024-11-06T23:40:52.660+01:00 level=DEBUG source=sched.go:407 msg="context for request finished"
time=2024-11-06T23:40:52.660+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s
time=2024-11-06T23:40:52.660+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0
[GIN] 2024/11/06 - 23:40:52 | 200 | 1.0847ms | 127.0.0.1 | POST "/api/chat"
time=2024-11-06T23:40:52.662+01:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d

Please, take a close look at the lines marked with prompt search for ###REMOVED, and you’ll see that:

My prompt is sent, then an error 500 occurs after supposedly 2m0s (though I’m not even sure it really ran for 2 minutes).
The same prompt is then sent again, and this time it returns a normal 200 response with a time of 55.3543941s -> for the same prompt. but thats not me, ollama does that alone

SOLVED -> 3. Afterward, there’s another 200 status with a duration of 1.0847ms, again extremely quick or almost instant. -> why 2 ? EDIT: I know now, i do a model unload after it has returned, sorry my bad please ignore the second fast status 200 response, i forgot that sorry ;)

Back to the 500 error... Unfortunately, I can’t log the response body after 500 error because everything continues as normal after the 500 error. As I mentioned, my WebRequest doesn’t return at the 500 error but only with the 200, and at that point, a regular response (answer) is included.

I noticed this for two reasons: First, the models sometimes run longer than usual (clearly because the first run fails with a 500 error), and after the 500 error, the GPU gets noticeably loud. Normally, it’s barely audible, even when running at 100%, but after the 500 error, something changes. The card still runs at 100%, but it gets significantly warmer, making the fan much more audible.

My assumption is that something in ollama fails or some timeout expires, ollama server changes something internally (maybe to ensure a next smooth run?), it runs again, and then successfully returns with a 200 status. Flash Attention possibly? Do you disable something after a bad run?

And why is no error there? As you told me, there should be some info after status 500, there is nothing ;(

EDIT! Whats that: time=2024-11-06T23:39:57.330+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" <<< directly after the 500 error "evaluating already loaded"?

error-500-2.txt

@ALLMI78 commented on GitHub (Nov 6, 2024): I’ve now tested with version 0.4.0, and the issue still persists. I’ve isolated a part of the attached log here... > llama_model_load: vocab only - skipping tensors **_time=2024-11-06T23:38:01.644+01:00 level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="###REMOVED" <<< sending my prompt_** time=2024-11-06T23:38:01.659+01:00 level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=0 prompt=4171 used=0 remaining=4171 time=2024-11-06T23:39:57.304+01:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-11-06T23:39:57.305+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s time=2024-11-06T23:39:57.305+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0 **_[GIN] 2024/11/06 - 23:39:57 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat"_** time=2024-11-06T23:39:57.330+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d **_time=2024-11-06T23:39:57.336+01:00 level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="###REMOVED" <<< again??? sending my prompt_** time=2024-11-06T23:39:57.348+01:00 level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=7235 prompt=4171 used=4170 remaining=1 **_[GIN] 2024/11/06 - 23:40:52 | 200 | 55.3543941s | 127.0.0.1 | POST "/api/chat"_** time=2024-11-06T23:40:52.660+01:00 level=DEBUG source=sched.go:407 msg="context for request finished" time=2024-11-06T23:40:52.660+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s time=2024-11-06T23:40:52.660+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0 **_[GIN] 2024/11/06 - 23:40:52 | 200 | 1.0847ms | 127.0.0.1 | POST "/api/chat"_** time=2024-11-06T23:40:52.662+01:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d Please, take a close look at the lines marked with prompt search for `###REMOVED`, and you’ll see that: 1. My prompt is sent, then an error 500 occurs after supposedly 2m0s (though I’m not even sure it really ran for 2 minutes). 2. The same prompt is then sent again, and this time it returns a normal 200 response with a time of 55.3543941s -> for the same prompt. but thats not me, ollama does that alone SOLVED -> 3. Afterward, there’s another 200 status with a duration of 1.0847ms, again extremely quick or almost instant. -> why 2 ? EDIT: I know now, i do a model unload after it has returned, sorry my bad please ignore the second fast status 200 response, i forgot that sorry ;) Back to the 500 error... Unfortunately, I can’t log the response body after 500 error because everything continues as normal after the 500 error. As I mentioned, my WebRequest doesn’t return at the 500 error but only with the 200, and at that point, a regular response (answer) is included. I noticed this for two reasons: First, the models sometimes run longer than usual (clearly because the first run fails with a 500 error), and after the 500 error, the GPU gets noticeably loud. Normally, it’s barely audible, even when running at 100%, but after the 500 error, something changes. The card still runs at 100%, but it gets significantly warmer, making the fan much more audible. My assumption is that something in ollama fails or some timeout expires, ollama server changes something internally (maybe to ensure a next smooth run?), it runs again, and then successfully returns with a 200 status. Flash Attention possibly? Do you disable something after a bad run? And why is no error there? As you told me, there should be some info after status 500, there is nothing ;( **_EDIT! Whats that: time=2024-11-06T23:39:57.330+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" <<< directly after the 500 error "evaluating already loaded"?_** [error-500-2.txt](https://github.com/user-attachments/files/17654012/error-500-2.txt)

GiteaMirror commented

2026-04-12 15:44:50 -05:00

@dhiltgen commented on GitHub (Nov 6, 2024):

There are various code paths that can result in a 500 error being returned to the client application, and in some the error is not logged on the server, but only returned to the client. We can work on improving that in a future version, but if you can find a way to log in your client on non-200 responses that may help us narrow down exactly where the problem is occurring.

@dhiltgen commented on GitHub (Nov 6, 2024): There are various code paths that can result in a 500 error being returned to the client application, and in some the error is not logged on the server, but only returned to the client. We can work on improving that in a future version, but if you can find a way to log in your client on non-200 responses that may help us narrow down exactly where the problem is occurring.

GiteaMirror commented

2026-04-12 15:44:50 -05:00

@ALLMI78 commented on GitHub (Nov 7, 2024):

I'm currently trying to do that, because sometimes it comes back with an error after 2 or more 500 error... System is running, but that can work for hours, last time it was running fine for 4-5 hours - until it fails...

Any idea on that -> time=2024-11-06T23:39:57.330+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded"

directly after the 500 error "evaluating already loaded"? i can not see that after the normal 200 status code

Thanks for your support mate ;)

@ALLMI78 commented on GitHub (Nov 7, 2024): I'm currently trying to do that, because sometimes it comes back with an error after 2 or more 500 error... System is running, but that can work for hours, last time it was running fine for 4-5 hours - until it fails... Any idea on that -> time=2024-11-06T23:39:57.330+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" - directly after the 500 error "evaluating already loaded"? i can not see that after the normal 200 status code Thanks for your support mate ;)

GiteaMirror commented

2026-04-12 15:44:51 -05:00

@rick-github commented on GitHub (Nov 7, 2024):

This looks like a timeout/retry.

23:37:57.398 client sends request, model load starts
23:38:01.119 model load finishes, processing starts
23:38:01.644 log prompt
23:38:01.659 cache miss
23:39:57.304 ollama request context finished: 500
23:39:57.330 client re-sends request, ollama checks if the current model is suitable
23:39:57.336 log prompt
23:39:57.348 cache hit
23:40:52.660 ollama request context finished: 200

There's 1m56.185 of processing in the first attempt, but the request terminates two minutes after the client sent the request. The client retries and gets a cache hit from the previous processing, letting the completion complete quicker and return a result.

I'm not familiar with the cache code so I don't know if my interpretation is correct, but it's a cache - it's supposed to speed things up, right?

@rick-github commented on GitHub (Nov 7, 2024): This looks like a timeout/retry. ``` 23:37:57.398 client sends request, model load starts 23:38:01.119 model load finishes, processing starts 23:38:01.644 log prompt 23:38:01.659 cache miss 23:39:57.304 ollama request context finished: 500 23:39:57.330 client re-sends request, ollama checks if the current model is suitable 23:39:57.336 log prompt 23:39:57.348 cache hit 23:40:52.660 ollama request context finished: 200 ``` There's 1m56.185 of processing in the first attempt, but the request terminates two minutes after the client sent the request. The client retries and gets a cache hit from the previous processing, letting the completion complete quicker and return a result. I'm not familiar with the cache code so I don't know if my interpretation is correct, but it's a cache - it's supposed to speed things up, right?

GiteaMirror commented

2026-04-12 15:44:51 -05:00

@rick-github commented on GitHub (Nov 7, 2024):

If you can run wireshark on your machine you could capture the packets at the time of the 500 and see any server response and which side closes the connection.

@rick-github commented on GitHub (Nov 7, 2024): If you can run wireshark on your machine you could capture the packets at the time of the 500 and see any server response and which side closes the connection.

GiteaMirror commented

2026-04-12 15:44:52 -05:00

@ALLMI78 commented on GitHub (Nov 7, 2024):

I got something...

MQL5 to explain the result

`string endpoint = "http://127.0.0.1:11434/api/chat";

int timeout = 1000;
ResetLastError();
int awc = WebRequest("POST", endpoint, header, timeout, postDataUTF8, result, responseHeader);
int gle = GetLastError();
responseHeader = CleanData(responseHeader);
if(gle != 0) {
Print(LINE);
Print(FUNCTION + "() | GLE-ERROR | WebRequest() | GLE "+ErrorDescription(gle));
Print(FUNCTION + "() | GLE-ERROR | WebRequest() | AWC "+HttpErrorDescription(awc));
}`

After 2 times error 500:

2024.11.07 01:55:36.068 SendAPIRequest() | GLE-ERROR | WebRequest() | GLE unknown error [5203]
2024.11.07 01:55:36.068 SendAPIRequest() | GLE-ERROR | WebRequest() | AWC error [1003] unknown error

5203 -> MQL5 -> ERR_WEBREQUEST_REQUEST_FAILED

1003 -> WebRequest return from ollama api

the returned result and responseHeader are both empty ;(

i got 4 error 500 in my ollama log, the first 3 did nothing, or it didnt stop. but number 4 did something different:

time=2024-11-07T01:53:36.142+01:00 level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=11486 prompt=8804 used=8803 remaining=1
time=2024-11-07T01:55:36.068+01:00 level=DEBUG source=sched.go:407 msg="context for request finished"
time=2024-11-07T01:55:36.068+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s
time=2024-11-07T01:55:36.068+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0
[GIN] 2024/11/07 - 01:55:36 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat"
time=2024-11-07T01:55:41.083+01:00 level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-07T01:55:41.083+01:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-07T01:55:41.083+01:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-07T01:55:41.083+01:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="31.0 GiB" before.free="24.5 GiB" before.free_swap="27.8 GiB" now.total="31.0 GiB" now.free="23.7 GiB" now.free_swap="14.5 GiB"
time=2024-11-07T01:55:41.098+01:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-cf79912c-84b7-d47e-b92c-67fd3713592f name="NVIDIA GeForce RTX 4060 Ti" overhead="662.1 MiB" before.total="16.0 GiB" before.free="14.9 GiB" now.total="16.0 GiB" now.free="2.7 GiB" now.used="12.6 GiB"
releasing nvml library
time=2024-11-07T01:55:41.143+01:00 level=DEBUG source=server.go:1073 msg="stopping llama server"
time=2024-11-07T01:55:41.148+01:00 level=DEBUG source=server.go:1079 msg="waiting for llama server to exit"
time=2024-11-07T01:55:41.353+01:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="31.0 GiB" before.free="23.7 GiB" before.free_swap="14.5 GiB" now.total="31.0 GiB" now.free="23.8 GiB" now.free_swap="26.7 GiB"
time=2024-11-07T01:55:41.393+01:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-cf79912c-84b7-d47e-b92c-67fd3713592f name="NVIDIA GeForce RTX 4060 Ti" overhead="662.1 MiB" before.total="16.0 GiB" before.free="2.7 GiB" now.total="16.0 GiB" now.free="14.9 GiB" now.used="476.8 MiB"
releasing nvml library
time=2024-11-07T01:55:41.393+01:00 level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.31 seconds" model=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-07T01:55:41.804+01:00 level=DEBUG source=server.go:1083 msg="llama server stopped"
time=2024-11-07T01:55:41.804+01:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-07T01:55:41.804+01:00 level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-07T01:55:41.804+01:00 level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests"

currently flash attn is off, other settings are:

2024/11/07 01:34:47 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\OLLAMA\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"

The last 2 [GIN] are error 500 and all other 500 before did not stop, they had a 200 after the restart or reload, but this time after 2 error 500 it has returned the webrequest...

The pattern is:

[500]+[200] -> no return of webrequest until the 200 comes back

[500] -> no webrequest return + another [500] -> webrequest returns with the 5203 + 1003

1003 -> WebRequest return from ollama api -> what does 1003 from ollama mean?

I hope it helps, thats all i can see atm...

@ALLMI78 commented on GitHub (Nov 7, 2024): I got something... MQL5 to explain the result > `string endpoint = "http://127.0.0.1:11434/api/chat"; int timeout = 1000; ResetLastError(); int awc = WebRequest("POST", endpoint, header, timeout, postDataUTF8, result, responseHeader); int gle = GetLastError(); responseHeader = CleanData(responseHeader); if(gle != 0) { Print(LINE); Print(__FUNCTION__ + "() | GLE-ERROR | WebRequest() | GLE "+ErrorDescription(gle)); Print(__FUNCTION__ + "() | GLE-ERROR | WebRequest() | AWC "+HttpErrorDescription(awc)); }` After 2 times error 500: > 2024.11.07 01:55:36.068 SendAPIRequest() | GLE-ERROR | WebRequest() | GLE unknown error [5203] 2024.11.07 01:55:36.068 SendAPIRequest() | GLE-ERROR | WebRequest() | AWC error [1003] unknown error 5203 -> MQL5 -> ERR_WEBREQUEST_REQUEST_FAILED **_1003 -> WebRequest return from ollama api_** the returned result and responseHeader are both empty ;( i got 4 error 500 in my ollama log, the first 3 did nothing, or it didnt stop. but number 4 did something different: > time=2024-11-07T01:53:36.142+01:00 level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=11486 prompt=8804 used=8803 remaining=1 time=2024-11-07T01:55:36.068+01:00 level=DEBUG source=sched.go:407 msg="context for request finished" time=2024-11-07T01:55:36.068+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s time=2024-11-07T01:55:36.068+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0 **_[GIN] 2024/11/07 - 01:55:36 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat"_** time=2024-11-07T01:55:41.083+01:00 level=DEBUG source=sched.go:341 **_msg="timer expired, expiring to unload"_** modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d time=2024-11-07T01:55:41.083+01:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d time=2024-11-07T01:55:41.083+01:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d time=2024-11-07T01:55:41.083+01:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="31.0 GiB" before.free="24.5 GiB" before.free_swap="27.8 GiB" now.total="31.0 GiB" now.free="23.7 GiB" now.free_swap="14.5 GiB" time=2024-11-07T01:55:41.098+01:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-cf79912c-84b7-d47e-b92c-67fd3713592f name="NVIDIA GeForce RTX 4060 Ti" overhead="662.1 MiB" before.total="16.0 GiB" before.free="14.9 GiB" now.total="16.0 GiB" now.free="2.7 GiB" now.used="12.6 GiB" releasing nvml library time=2024-11-07T01:55:41.143+01:00 level=DEBUG source=server.go:1073 msg="stopping llama server" time=2024-11-07T01:55:41.148+01:00 level=DEBUG source=server.go:1079 msg="waiting for llama server to exit" time=2024-11-07T01:55:41.353+01:00 level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="31.0 GiB" before.free="23.7 GiB" before.free_swap="14.5 GiB" now.total="31.0 GiB" now.free="23.8 GiB" now.free_swap="26.7 GiB" time=2024-11-07T01:55:41.393+01:00 level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-cf79912c-84b7-d47e-b92c-67fd3713592f name="NVIDIA GeForce RTX 4060 Ti" overhead="662.1 MiB" before.total="16.0 GiB" before.free="2.7 GiB" now.total="16.0 GiB" now.free="14.9 GiB" now.used="476.8 MiB" releasing nvml library time=2024-11-07T01:55:41.393+01:00 level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.31 seconds" model=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d time=2024-11-07T01:55:41.804+01:00 level=DEBUG source=server.go:1083 msg="llama server stopped" time=2024-11-07T01:55:41.804+01:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d time=2024-11-07T01:55:41.804+01:00 level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d time=2024-11-07T01:55:41.804+01:00 level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests" - currently flash attn is off, other settings are: > 2024/11/07 01:34:47 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:M:\\OLLAMA\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" The last 2 [GIN] are error 500 and all other 500 before did not stop, they had a 200 after the restart or reload, but this time after 2 error 500 it has returned the webrequest... The pattern is: [500]+[200] -> no return of webrequest until the 200 comes back [500] -> no webrequest return + another [500] -> webrequest returns with the 5203 + 1003 1003 -> WebRequest return from ollama api -> what does 1003 from ollama mean? I hope it helps, thats all i can see atm...

GiteaMirror commented

2026-04-12 15:44:54 -05:00

@rick-github commented on GitHub (Nov 7, 2024):

I'm guessing that the requests have keep_alive=5.

... earlier failures, client retries
01:53:36.142 cache hit
01:55:36.068 ollama request context finished: 500
01:55:36.068 set keep_alive timer of 5 seconds
01:55:36     Log 500
01:55:36.068 WebRequest returns error
01:55:41.083 client hasn't retried, keep_alive timer expires, start model unload
01:55:41.804 model unloaded

It's like the client has a maximum number of retries, and when it exceeded that this time, it finally returned an error up the call stack.

@rick-github commented on GitHub (Nov 7, 2024): I'm guessing that the requests have `keep_alive=5`. ``` ... earlier failures, client retries 01:53:36.142 cache hit 01:55:36.068 ollama request context finished: 500 01:55:36.068 set keep_alive timer of 5 seconds 01:55:36 Log 500 01:55:36.068 WebRequest returns error 01:55:41.083 client hasn't retried, keep_alive timer expires, start model unload 01:55:41.804 model unloaded ``` It's like the client has a maximum number of retries, and when it exceeded that this time, it finally returned an error up the call stack.

GiteaMirror commented

2026-04-12 15:44:55 -05:00

@ALLMI78 commented on GitHub (Nov 7, 2024):

Yes, keep_alive is set to 5 because previously I had keep_alive at 0, and when a 500 error occurred, the model was immediately unloaded and then reloaded to continue. With keep_alive set to 5, the model remains loaded, but unfortunately, this still didn’t solve my issue that after two 500 errors, the WebRequest returns empty.

I’ve been struggling with this issue for several weeks and haven’t found any leads on how to solve it so far.

1003 -> WebRequest return from ollama api -> what does 1003 from ollama mean?

@ALLMI78 commented on GitHub (Nov 7, 2024): Yes, keep_alive is set to 5 because previously I had keep_alive at 0, and when a 500 error occurred, the model was immediately unloaded and then reloaded to continue. With keep_alive set to 5, the model remains loaded, but unfortunately, this still didn’t solve my issue that after two 500 errors, the WebRequest returns empty. I’ve been struggling with this issue for several weeks and haven’t found any leads on how to solve it so far. 1003 -> WebRequest return from ollama api -> what does 1003 from ollama mean?

GiteaMirror commented

2026-04-12 15:44:56 -05:00

@rick-github commented on GitHub (Nov 7, 2024):

ollama doesn't return 1003, it's an MQL5 internal error code.

It still looks like a client timeout. Note that the client is not necessarily your program, it may be something in between like a load balancer, but you don't have a proxy so not likely.

@rick-github commented on GitHub (Nov 7, 2024): ollama doesn't return 1003, it's an MQL5 internal error code. It still looks like a client timeout. Note that the client is not necessarily your program, it may be something in between like a load balancer, but you don't have a proxy so not likely.

GiteaMirror commented

2026-04-12 15:44:57 -05:00

@ALLMI78 commented on GitHub (Nov 7, 2024):

sorry mate but there is nothing in between, its my local pc, just the mt5 (metatrader 5) and ollama windows api...

"ollama doesn't return 1003, it's an MQL5 internal error code."

are you sure about that? the 5203 from GetLastError() is the MQL5 code...

because ollama also returns the 200 that i want to see? i always check that with if(awc != 200) and so on...

awc in my example is the returned http code from the server called in the webrequest, so in this case, the ollama server

After 2 times error 500:

2024.11.07 01:55:36.068 SendAPIRequest() | GLE-ERROR | WebRequest() | GLE unknown error [5203]
2024.11.07 01:55:36.068 SendAPIRequest() | GLE-ERROR | WebRequest() | AWC error [1003] unknown error

5203 -> MQL5 -> ERR_WEBREQUEST_REQUEST_FAILED

1003 -> WebRequest http status code returned from ollama server or api

@ALLMI78 commented on GitHub (Nov 7, 2024): sorry mate but there is nothing in between, its my local pc, just the mt5 (metatrader 5) and ollama windows api... "ollama doesn't return 1003, it's an MQL5 internal error code." are you sure about that? the 5203 from GetLastError() is the MQL5 code... because ollama also returns the 200 that i want to see? i always check that with if(awc != 200) and so on... awc in my example is the returned http code from the server called in the webrequest, so in this case, the ollama server After 2 times error 500: 2024.11.07 01:55:36.068 SendAPIRequest() | GLE-ERROR | WebRequest() | GLE unknown error [5203] 2024.11.07 01:55:36.068 SendAPIRequest() | GLE-ERROR | WebRequest() | AWC error [1003] unknown error 5203 -> **_MQL5_** -> ERR_WEBREQUEST_REQUEST_FAILED 1003 -> WebRequest http status code returned from ollama server or api

GiteaMirror commented

2026-04-12 15:44:59 -05:00

@rick-github commented on GitHub (Nov 7, 2024):

200, 500, 404, etc are well defined HTTP return codes. 1003 is not on the list. My guess is that MQL5 developers have decided to overload the return value of WebRequest, using numbers < 1000 as normal HTTP return codes, and numbers > 1000 as MQL5 specific return codes.

@rick-github commented on GitHub (Nov 7, 2024): 200, 500, 404, etc are well defined [HTTP return codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). 1003 is not on the list. My guess is that MQL5 developers have decided to overload the return value of WebRequest, using numbers < 1000 as normal HTTP return codes, and numbers > 1000 as MQL5 specific return codes.

GiteaMirror commented

2026-04-12 15:45:01 -05:00

@ALLMI78 commented on GitHub (Nov 7, 2024):

that is maybe possible, but we have no prove for that...?

https://www.mql5.com/en/docs/network/webrequest -> Return Value HTTP server response code or -1 for an error.

my chatgpt has maybe found something:

Error 1003 may occur in connection with an Internal Server Error or similar issues. In some cases, the error was caused by improperly handled Unicode characters related to the detokenize function in Ollama on Windows systems. There were reports of the error occurring when the chat history was reused after switching models (e.g., from 'llava:7b-v1.6' to 'llama3.1') without resetting the context. This led to issues in decoding text and caused the error 500 in the server log. It is also suspected that this error occurs when the subprocess outputs Unicode characters in a form that the JSON deserializer does not expect.

If you encounter this error in your setup, it might help to clear the context when switching models or ensure that there are no invalid or malformed characters in the input data. It was also mentioned that the issue might be resolved with future changes to the detokenize implementation."

It’s exactly what I am doing—I'm loading two models in a rotation, one is exactly llama3.1, and the other is qwen2.5.

CHATGPT found it here:

https://github.com/ollama/ollama/issues/3072
https://github.com/ollama/ollama/issues/6196

i can't find something with 1003 there but the model switching and error 500 are the same ...

does unloading a model really clean everything ???

i run lama -> unload -> run qwen -> unload -> run lama and so on...

@ALLMI78 commented on GitHub (Nov 7, 2024): that is maybe possible, but we have no prove for that...? https://www.mql5.com/en/docs/network/webrequest -> Return Value HTTP server response code or -1 for an error. my chatgpt has maybe found something: > Error 1003 may occur in connection with an Internal Server Error or similar issues. In some cases, the error was caused by improperly handled Unicode characters related to the detokenize function in Ollama on Windows systems. There were reports of the error occurring when the chat history was reused after switching models (e.g., from 'llava:7b-v1.6' to 'llama3.1') without resetting the context. This led to issues in decoding text and caused the error 500 in the server log. It is also suspected that this error occurs when the subprocess outputs Unicode characters in a form that the JSON deserializer does not expect. > If you encounter this error in your setup, it might help to clear the context when switching models or ensure that there are no invalid or malformed characters in the input data. It was also mentioned that the issue might be resolved with future changes to the detokenize implementation." It’s exactly what I am doing—I'm loading two models in a rotation, one is exactly llama3.1, and the other is qwen2.5. CHATGPT found it here: >> https://github.com/ollama/ollama/issues/3072 >> https://github.com/ollama/ollama/issues/6196 i can't find something with 1003 there but the model switching and error 500 are the same ... does unloading a model really clean everything ??? i run lama -> unload -> run qwen -> unload -> run lama and so on...

GiteaMirror commented

2026-04-12 15:45:02 -05:00

@rick-github commented on GitHub (Nov 7, 2024):

1003 is not a valid HTTP server response. 1003 is not even in the source code.

ChatGPT is taking a kernel of truth and hallucinating an entire field of wheat. Your logs don't indicate any decode errors, the issue was fixed some time ago, and it doesn't explain the two minute timeout or the client retries.

@rick-github commented on GitHub (Nov 7, 2024): 1003 is not a valid HTTP server response. 1003 is not even in the [source code](https://pkg.go.dev/net/http#pkg-constants). ChatGPT is taking a kernel of truth and hallucinating an entire field of wheat. Your logs don't indicate any decode errors, the issue was fixed some time ago, and it doesn't explain the two minute timeout or the client retries.

GiteaMirror commented

2026-04-12 15:45:03 -05:00

@ALLMI78 commented on GitHub (Nov 7, 2024):

Thanks for the explanation... I'm aware that 1003 is not a standard HTTP status code.

I had a similar thought but in reverse: I suspected that maybe Ollama or llama.cpp could be returning a custom status code 1003. I hadn’t mentioned to ChatGPT that I was switching between different models, so the insight it found really seems relevant. I had specifically asked ChatGPT to search for connections between 1003 and Ollama or llama.cpp, which led to these findings.

If 1003 isn’t actually a specific return code from Ollama or llama.cpp, I agree with your theory that the problem might lie with the MQL5 WebRequest. This is supported by the fact that no one else seems to have encountered this 'Error 500 after 2 minutes' issue.

So, your assumption would be that there’s a problem with the MQL5 WebRequest where Ollama outputs a 500 error, but then the MQL5 WebRequest just converts it into a 1003 instead?

@ALLMI78 commented on GitHub (Nov 7, 2024): Thanks for the explanation... I'm aware that 1003 is not a standard HTTP status code. I had a similar thought but in reverse: I suspected that maybe Ollama or llama.cpp could be returning a custom status code 1003. I hadn’t mentioned to ChatGPT that I was switching between different models, so the insight it found really seems relevant. I had specifically asked ChatGPT to search for connections between 1003 and Ollama or llama.cpp, which led to these findings. If 1003 isn’t actually a specific return code from Ollama or llama.cpp, I agree with your theory that the problem might lie with the MQL5 WebRequest. This is supported by the fact that no one else seems to have encountered this 'Error 500 after 2 minutes' issue. So, your assumption would be that there’s a problem with the MQL5 WebRequest where Ollama outputs a 500 error, but then the MQL5 WebRequest just converts it into a 1003 instead?

GiteaMirror commented

2026-04-12 15:45:05 -05:00

@rick-github commented on GitHub (Nov 7, 2024):

My assumption is that WebRequest has some hidden internals that cause it to timeout after two minutes of no response from a server, retry that 2 more times, and if there's still no response, return a 1003. If that's the case and there's no feature to override that behaviour, then the way around it would be to enable streaming in the ollama request and aggregate the results in a loop around the WebRequest call.

@rick-github commented on GitHub (Nov 7, 2024): My assumption is that WebRequest has some hidden internals that cause it to timeout after two minutes of no response from a server, retry that 2 more times, and if there's still no response, return a 1003. If that's the case and there's no feature to override that behaviour, then the way around it would be to enable streaming in the ollama request and aggregate the results in a loop around the WebRequest call.

GiteaMirror commented

2026-04-12 15:45:06 -05:00

@rick-github commented on GitHub (Nov 7, 2024):

Something like

CJAVal jv;
string content;
int awc;

content = "";
While (1) {
  awc = WebRequest("POST", endpoint, header, timeout, postDataUTF8, result, responseHeader);
  if (awc != 200)
    break;
  jv.Deserialize(result);
  content.Add(jv["message"]["content"]);
  if (jv["done"])
    break;
}

@rick-github commented on GitHub (Nov 7, 2024): Something like ```c++ CJAVal jv; string content; int awc; content = ""; While (1) { awc = WebRequest("POST", endpoint, header, timeout, postDataUTF8, result, responseHeader); if (awc != 200) break; jv.Deserialize(result); content.Add(jv["message"]["content"]); if (jv["done"]) break; } ```

GiteaMirror commented

2026-04-12 15:45:08 -05:00

@ALLMI78 commented on GitHub (Nov 7, 2024):

That sounds very plausible. I was just about to ask you, how your theory fits with the fact that my request is actually sent twice.

time=2024-11-06T23:38:01.644+01:00 level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="###REMOVED" <<< sending my prompt
time=2024-11-06T23:38:01.659+01:00 level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=0 prompt=4171 used=0 remaining=4171
time=2024-11-06T23:39:57.304+01:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2024-11-06T23:39:57.305+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s
time=2024-11-06T23:39:57.305+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0
[GIN] 2024/11/06 - 23:39:57 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat"
time=2024-11-06T23:39:57.330+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-11-06T23:39:57.336+01:00 level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="###REMOVED" <<< again??? sending my prompt
time=2024-11-06T23:39:57.348+01:00 level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=7235 prompt=4171 used=4170 remaining=1
[GIN] 2024/11/06 - 23:40:52 | 200 | 55.3543941s | 127.0.0.1 | POST "/api/chat"

But you already explained that, hmmm...

Can you please explain what is the ""evaluating already loaded" after the 500 error?

Alright, with stream=true, this issue shouldn’t occur then - I’ll test it out! ;)

@ALLMI78 commented on GitHub (Nov 7, 2024): That sounds very plausible. I was just about to ask you, how your theory fits with the fact that my request is actually sent twice. > **time=2024-11-06T23:38:01.644+01:00 level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="###REMOVED" <<< sending my prompt** time=2024-11-06T23:38:01.659+01:00 level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=0 prompt=4171 used=0 remaining=4171 time=2024-11-06T23:39:57.304+01:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-11-06T23:39:57.305+01:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=5s time=2024-11-06T23:39:57.305+01:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0 [GIN] 2024/11/06 - 23:39:57 | 500 | 2m0s | 127.0.0.1 | POST "/api/chat" time=2024-11-06T23:39:57.330+01:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=M:\OLLAMA\models\blobs\sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d **time=2024-11-06T23:39:57.336+01:00 level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="###REMOVED" <<< again??? sending my prompt** time=2024-11-06T23:39:57.348+01:00 level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=7235 prompt=4171 used=4170 remaining=1 [GIN] 2024/11/06 - 23:40:52 | 200 | 55.3543941s | 127.0.0.1 | POST "/api/chat" But you already explained that, hmmm... Can you please explain what is the ""evaluating already loaded" after the 500 error? Alright, with stream=true, this issue shouldn’t occur then - I’ll test it out! ;)

GiteaMirror commented

2026-04-12 15:45:09 -05:00

@rick-github commented on GitHub (Nov 7, 2024):

Whenever a request comes in, ollama checks to see if there's a currently loaded model that can be used to process the request. "evaluating already loaded" means that just after the 500, the client re-sent the query, so ollama checked that the current model would be able to be used.

@rick-github commented on GitHub (Nov 7, 2024): Whenever a request comes in, ollama checks to see if there's a currently loaded model that can be used to process the request. "evaluating already loaded" means that just after the 500, the client re-sent the query, so ollama checked that the current model would be able to be used.

GiteaMirror commented

2026-04-12 15:45:10 -05:00

@ALLMI78 commented on GitHub (Nov 7, 2024):

Thanks mate, and does unloading a model (with messages:[] and keep_alive:0) really clean everything, all stuff, caches and so on?

string UnloadModelMessage(__STRUCT_MDL &MDL) {
string msg= "{";
msg+= ""model":""+MDL.file_name+"",";
msg+= ""messages":[],"keep_alive":0";
msg+= "}";
return msg;
}

So switching between models is no problem ? I was not sure if this is the cause...

@ALLMI78 commented on GitHub (Nov 7, 2024): Thanks mate, and does unloading a model (with messages:[] and keep_alive:0) really clean everything, all stuff, caches and so on? > string UnloadModelMessage(__STRUCT_MDL &MDL) { string msg= "{"; msg+= "\"model\":\""+MDL.file_name+"\","; msg+= "\"messages\":[],\"keep_alive\":0"; msg+= "}"; return msg; } So switching between models is no problem ? I was not sure if this is the cause...

GiteaMirror commented

2026-04-12 15:45:11 -05:00

@rick-github commented on GitHub (Nov 7, 2024):

It should clean everything. If it doesn't, that would be a bug. Model switching should be no problem.

@rick-github commented on GitHub (Nov 7, 2024): It should clean everything. If it doesn't, that would be a bug. Model switching should be no problem.

GiteaMirror commented

2026-04-12 15:45:12 -05:00

@ALLMI78 commented on GitHub (Nov 7, 2024):

04:43:20 | 200 | 2m6s | 127.0.0.1 | POST "/api/chat"

2 min for the first time with stream = true ;)

but then: [GIN] 2024/11/07 - 04:53:04 | 200 | 2m0s | 127.0.0.1 | POST "/api/chat"

and

stream = true does not help, but as you can see there was 200 in ollama and webrequest returns on that with 5203 and 1003, empty header and empty body, so yes that maybe seems to be a problem with webrequest ;(...

GiteaMirror commented

2026-04-12 15:45:13 -05:00

@rick-github commented on GitHub (Nov 7, 2024):

Perhaps WebRequest wants the full response to arrive and the client to close the socket, signalling it's finished, before it returns. Maybe you have to spin your own WebRequest with SocketCreate/SocketConnect.

@rick-github commented on GitHub (Nov 7, 2024): Perhaps WebRequest wants the full response to arrive and the client to close the socket, signalling it's finished, before it returns. Maybe you have to spin your own WebRequest with SocketCreate/SocketConnect.

GiteaMirror commented

2026-04-12 15:45:14 -05:00

@rick-github commented on GitHub (Nov 7, 2024):

Do you know if ollama has actually output any tokens when it times out? Maybe your completion is just so complex that it takes more than two minutes before it emits something that WebRequest can read.

You could try using curl with one of your prompts and see how long it takes for ollama to respond.

@rick-github commented on GitHub (Nov 7, 2024): Do you know if ollama has actually output any tokens when it times out? Maybe your completion is just so complex that it takes more than two minutes before it emits something that WebRequest can read. You could try using curl with one of your prompts and see how long it takes for ollama to respond.

GiteaMirror commented

2026-04-12 15:45:15 -05:00

@ALLMI78 commented on GitHub (Nov 7, 2024):

Hello, I’m going to take a close look at this now, but it might take some time. I need to check what options are available for accessing the Ollama API from MQL5, maybe using sockets or ShellExecute / curl.

There are indications that other MQL5 users have also had issues with 1003 errors:

'1003 - WebRequest implementation internal error. You can check which error description is returned in the result parameter of the WebRequest function.' -> In my case, result is empty.

'Same error message 1003. 2024.03.14 03:22:04.475 MQL5 Market failed to send request downloading product "Trade Assistant MT5" (raw - read failed, 0 error, 1 byte needed) [1003].'

The message 'read failed, 0 error, 1 byte needed' is interesting here, but it seems like 1003 is a general error that can appear in very different situations. What surprises me, though, is that MetaTrader 5 and MQL5 have been around for a while, and if this is indeed a bug, it hasn’t been fixed yet.

'Do you know if Ollama has actually output any tokens when it times out? Maybe your completion is so complex that it takes more than two minutes before it emits anything that WebRequest can read.'

This was my thought as well—that my request might be too complex or unsolvable. But I don’t understand why, on retry after a 500 error, I often get a response on a new run. In most cases, I first get a 500 error and then a 200 response within two minutes. Only in rare cases I do get two 500 errors, at which point I cancel on the client side.

I’ll update you once I know more. ;)

@ALLMI78 commented on GitHub (Nov 7, 2024): Hello, I’m going to take a close look at this now, but it might take some time. I need to check what options are available for accessing the Ollama API from MQL5, maybe using sockets or ShellExecute / curl. There are indications that other MQL5 users have also had issues with 1003 errors: > **_'1003 - WebRequest implementation internal error. You can check which error description is returned in the result parameter of the WebRequest function.' -> In my case, result is empty._** > 'Same error message 1003. 2024.03.14 03:22:04.475 MQL5 Market failed to send request downloading product "Trade Assistant MT5" (raw - read failed, 0 error, 1 byte needed) [1003].' The message 'read failed, 0 error, 1 byte needed' is interesting here, but it seems like 1003 is a general error that can appear in very different situations. What surprises me, though, is that MetaTrader 5 and MQL5 have been around for a while, and if this is indeed a bug, it hasn’t been fixed yet. > 'Do you know if Ollama has actually output any tokens when it times out? Maybe your completion is so complex that it takes more than two minutes before it emits anything that WebRequest can read.' This was my thought as well—that my request might be too complex or unsolvable. But I don’t understand why, on retry after a 500 error, I often get a response on a new run. In most cases, I first get a 500 error and then a 200 response within two minutes. Only in rare cases I do get two 500 errors, at which point I cancel on the client side. I’ll update you once I know more. ;)

GiteaMirror commented

2026-04-12 15:45:16 -05:00

@rick-github commented on GitHub (Nov 7, 2024):

I don’t understand why, on retry after a 500 error, I often get a response on a new run.

Cache.

@rick-github commented on GitHub (Nov 7, 2024): > I don’t understand why, on retry after a 500 error, I often get a response on a new run. Cache.

GiteaMirror commented

2026-04-12 15:45:17 -05:00

@ALLMI78 commented on GitHub (Nov 7, 2024):

Ok, i did some tests with my large prompts in win powershell and

[GIN] 2024/11/07 - 19:44:40 | 200 | 3m56s | 127.0.0.1 | POST "/api/chat"

no 500 error...

Sorry for the confusing bug search but WebRequest was working fine before that problem...

Many thanks again for the fast support ;)

EDIT: I'm now using a self created WebRequest by using ShellExecuteW in MQL5 to send a Powershell Invoke-WebRequest and i can confirm that the runtime limit was on the client side, it works very well now...

[GIN] 2024/11/11 - 00:30:02 | 200 | 6m9s | 127.0.0.1 | POST "/api/chat"

0 error 500, thanks again ;)

@ALLMI78 commented on GitHub (Nov 7, 2024): Ok, i did some tests with my large prompts in win powershell and [GIN] 2024/11/07 - 19:44:40 | 200 | 3m56s | 127.0.0.1 | POST "/api/chat" no 500 error... Sorry for the confusing bug search but WebRequest was working fine before that problem... Many thanks again for the fast support ;) EDIT: I'm now using a self created WebRequest by using ShellExecuteW in MQL5 to send a Powershell Invoke-WebRequest and i can confirm that the runtime limit was on the client side, it works very well now... [GIN] 2024/11/11 - 00:30:02 | 200 | 6m9s | 127.0.0.1 | POST "/api/chat" 0 error 500, thanks again ;)

GiteaMirror referenced this issue

2026-04-22 07:18:20 -05:00

[GH-ISSUE #4787] MiniCPM-Llama3-V-2_5 #28776

GiteaMirror referenced this issue

2026-04-28 12:07:18 -05:00

[GH-ISSUE #4787] MiniCPM-Llama3-V-2_5 #49527

GiteaMirror referenced this issue

2026-05-03 19:39:17 -05:00

[GH-ISSUE #4787] MiniCPM-Llama3-V-2_5 #65053

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

dhiltgen/llama-runner

parth-launch-codex-app

hoyyeva/anthropic-local-image-path

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#4787