[GH-ISSUE #9126] llama server loading timed out in windows server 2022 vm #5936

Closed
opened 2026-04-12 17:16:33 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @Asif-droid on GitHub (Feb 15, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9126

What is the issue?

I'm facing this issue for some times and its not consistent. sometimes the ollama container gives responses, sometimes shows this error and sometimes just hangs.
my server vm conf:
Windows Server 2022 is based on Windows 10 (21H2)

Docker compose for ollama:
services:
ollama:
image: ollama/ollama:0.3.6
volumes:
- ./llm_cache/:/root/.ollama/
restart: unless-stopped
ports:
- 8555:11434

request to api:

ollama_payload = {
"model": "llama3.1:latest",
"system": json.dumps(prompt),
"prompt": json.dumps(text),
"format": "json",
"stream": False,
"options": {
"temperature": 0.3,
# "top_k": 20,
"num_ctx": 8192,
}
}
ollama_response = requests.post(url="http://localhost:8555/api/generate",
headers={"Content-Type": "application/json"},
data=json.dumps(ollama_payload))

errors in output logs:

Image
Image

Relevant log output


OS

Windows, Docker

GPU

no gpu

CPU

intel Xeon gold 6338

Ollama version

0.3.6

Originally created by @Asif-droid on GitHub (Feb 15, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9126 ### What is the issue? I'm facing this issue for some times and its not consistent. sometimes the ollama container gives responses, sometimes shows this error and sometimes just hangs. my server vm conf: Windows Server 2022 is based on Windows 10 (21H2) Docker compose for ollama: services: ollama: image: ollama/ollama:0.3.6 volumes: - ./llm_cache/:/root/.ollama/ restart: unless-stopped ports: - 8555:11434 request to api: ollama_payload = { "model": "llama3.1:latest", "system": json.dumps(prompt), "prompt": json.dumps(text), "format": "json", "stream": False, "options": { "temperature": 0.3, # "top_k": 20, "num_ctx": 8192, } } ollama_response = requests.post(url="http://localhost:8555/api/generate", headers={"Content-Type": "application/json"}, data=json.dumps(ollama_payload)) errors in output logs: ![Image](https://github.com/user-attachments/assets/7da690fe-e61a-4889-8fb6-84d2c968ca98) ![Image](https://github.com/user-attachments/assets/fbee2730-10c5-4cdc-a111-190e575f31f9) ### Relevant log output ```shell ``` ### OS Windows, Docker ### GPU no gpu ### CPU intel Xeon gold 6338 ### Ollama version 0.3.6
GiteaMirror added the bug label 2026-04-12 17:16:33 -05:00
Author
Owner

@LeisureLinux commented on GitHub (Feb 15, 2025):

too old ollama version. suggest you upgrade to latest first.

<!-- gh-comment-id:2660822876 --> @LeisureLinux commented on GitHub (Feb 15, 2025): too old ollama version. suggest you upgrade to latest first.
Author
Owner

@rick-github commented on GitHub (Feb 15, 2025):

Set OLLAMA_LOAD_TIMEOUT=30m in the server environment.

<!-- gh-comment-id:2660888685 --> @rick-github commented on GitHub (Feb 15, 2025): Set [`OLLAMA_LOAD_TIMEOUT=30m`](https://github.com/ollama/ollama/blob/d006e1e09be4d3da3fb94ab683aa18822af4b956/envconfig/config.go#L245) in the server environment.
Author
Owner

@Asif-droid commented on GitHub (Feb 16, 2025):

Recently, I'm having blank responses in case of large prompts. what could be the reason ?

<!-- gh-comment-id:2661244444 --> @Asif-droid commented on GitHub (Feb 16, 2025): Recently, I'm having blank responses in case of large prompts. what could be the reason ?
Author
Owner

@rick-github commented on GitHub (Feb 16, 2025):

Serve logs will aid in debugging.

<!-- gh-comment-id:2661440378 --> @rick-github commented on GitHub (Feb 16, 2025): [Serve logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@elarbor commented on GitHub (Feb 18, 2025):

有中文版的接口文档吗?

<!-- gh-comment-id:2664632886 --> @elarbor commented on GitHub (Feb 18, 2025): 有中文版的[接口文档](https://github.com/ollama/ollama/blob/main/docs/api.md)吗?
Author
Owner

@Asif-droid commented on GitHub (Feb 18, 2025):

currently running on windows server 2022 vm with conf: 4 cpu, 12 gb ram. getting this error. the vm has no access to gpu. model = llama3.1:latest

Image

<!-- gh-comment-id:2664664048 --> @Asif-droid commented on GitHub (Feb 18, 2025): currently running on windows server 2022 vm with conf: 4 cpu, 12 gb ram. getting this error. the vm has no access to gpu. model = llama3.1:latest ![Image](https://github.com/user-attachments/assets/1edd07c3-9525-46a2-8e90-6d872b3f47dc)
Author
Owner
<!-- gh-comment-id:2664971262 --> @rick-github commented on GitHub (Feb 18, 2025): [Currently GPU support in Docker Desktop is only available on Windows with the WSL2 backend.](https://docs.docker.com/desktop/features/gpu/)
Author
Owner

@Asif-droid commented on GitHub (Feb 18, 2025):

I'm trying to run it on cpu-only mode. I know it'll be very slow but I have no choice.

<!-- gh-comment-id:2665221132 --> @Asif-droid commented on GitHub (Feb 18, 2025): I'm trying to run it on cpu-only mode. I know it'll be very slow but I have no choice.
Author
Owner

@rick-github commented on GitHub (Feb 18, 2025):

So then a missing driver is expected.

<!-- gh-comment-id:2665373120 --> @rick-github commented on GitHub (Feb 18, 2025): So then a missing driver is expected.
Author
Owner

@Asif-droid commented on GitHub (Feb 24, 2025):

I have solved the issue; it was the vm that could not access the model pulling again inside the container that solved this for me.
And I used flash attention in the env for a faster response.
this is my env variables:

  • OLLAMA_DEBUG=1
  • OLLAMA_FLASH_ATTENTION=1
  • OLLAMA_NUM_PARALLEL=1 # Adjust based on your CPU power
  • OLLAMA_MAX_LOADED=1 # Limit the number of models loaded into memory
  • OLLAMA_KEEP_ALIVE=24h

and in request :
num_ctx=8192
This is important as in my previous req the prompts were truncated as I mistakenly put 8196 instead of 8192, so the default 2048 was being set.

<!-- gh-comment-id:2677806539 --> @Asif-droid commented on GitHub (Feb 24, 2025): I have solved the issue; it was the vm that could not access the model pulling again inside the container that solved this for me. And I used flash attention in the env for a faster response. this is my env variables: - OLLAMA_DEBUG=1 - OLLAMA_FLASH_ATTENTION=1 - OLLAMA_NUM_PARALLEL=1 # Adjust based on your CPU power - OLLAMA_MAX_LOADED=1 # Limit the number of models loaded into memory - OLLAMA_KEEP_ALIVE=24h and in request : num_ctx=8192 This is important as in my previous req the prompts were truncated as I mistakenly put 8196 instead of 8192, so the default 2048 was being set.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5936