[GH-ISSUE #8214] /api/chat and /api/generate endpoints are timing out #67299

Closed
opened 2026-05-04 09:53:59 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @wkevin on GitHub (Dec 23, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/8214

What is the issue?

Currently, the /api/tags and /api/ps endpoints are functioning properly, but the /api/chat and /api/generate endpoints are experiencing timeouts.

Docker Image Version: 0.5.4

Upon checking the server information, I noticed that the VRAM usage and processes reported by nvidia-smi do not match those shown by the ollama ps command.

image

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.4

Originally created by @wkevin on GitHub (Dec 23, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/8214 ### What is the issue? Currently, the `/api/tags` and `/api/ps` endpoints are functioning properly, but the `/api/chat` and `/api/generate` endpoints are experiencing timeouts. Docker Image Version: 0.5.4 Upon checking the server information, I noticed that the VRAM usage and processes reported by `nvidia-smi` do not match those shown by the `ollama ps` command. ![image](https://github.com/user-attachments/assets/17ac4c3f-097c-410c-9c62-b388c0ede2c1) ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.4
GiteaMirror added the bug label 2026-05-04 09:53:59 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 23, 2024):

Server logs will aid in debugging.

<!-- gh-comment-id:2558781810 --> @rick-github commented on GitHub (Dec 23, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@wkevin commented on GitHub (Dec 23, 2024):

Server logs will aid in debugging.

I have rolled back to the previous version docker image, so docker logs has gone.
Now, switching to 0.5.4 reproduce this issue, Please wait. Thanks.

<!-- gh-comment-id:2558789981 --> @wkevin commented on GitHub (Dec 23, 2024): > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging. I have rolled back to the previous version docker image, so docker logs has gone. Now, switching to 0.5.4 reproduce this issue, Please wait. Thanks.
Author
Owner

@a1ananth commented on GitHub (Dec 23, 2024):

I am having this same issue with v 0.5.4. It works fine if I downgrade, but due to auto-update, it crashes again.

My logs are:

time=2024-12-23T13:31:57.296+05:30 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=E:\AI\Ollama\models\blobs\sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463 gpu=GPU-345e1da7-095d-c43e-50f9-cb9f21b7d201 parallel=4 available=11379396608 required="5.6 GiB"
time=2024-12-23T13:31:57.314+05:30 level=INFO source=server.go:104 msg="system memory" total="63.9 GiB" free="41.7 GiB" free_swap="49.2 GiB"
time=2024-12-23T13:31:57.314+05:30 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.6 GiB" memory.required.partial="5.6 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[5.6 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB"
time=2024-12-23T13:31:57.315+05:30 level=INFO source=server.go:160 msg="user override" OLLAMA_LLM_LIBRARY=cuda_v12 path=D:\Programs\AI\ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe
time=2024-12-23T13:31:57.317+05:30 level=INFO source=server.go:376 msg="starting llama server" cmd="D:\\Programs\\AI\\ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe runner --model E:\\AI\\Ollama\\models\\blobs\\sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463 --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 8 --no-mmap --parallel 4 --port 59599"
time=2024-12-23T13:31:57.320+05:30 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-12-23T13:31:57.320+05:30 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2024-12-23T13:31:57.320+05:30 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2024-12-23T13:31:57.389+05:30 level=INFO source=runner.go:941 msg="starting go runner"
time=2024-12-23T13:31:57.389+05:30 level=INFO source=runner.go:942 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(clang)" threads=8
time=2024-12-23T13:31:57.390+05:30 level=INFO source=.:0 msg="Server listening on 127.0.0.1:8080"
gguf_init_from_file: failed to open '': 'Invalid argument'
llama_model_load: error loading model: llama_model_loader: failed to load model from 

llama_load_model_from_file: failed to load model
panic: unable to load model: 

goroutine 6 [running]:
main.(*Server).loadModel(0xc0000a8120, {0x0, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc0000241c0, 0x0}, ...)
	github.com/ollama/ollama/llama/runner/runner.go:863 +0x3ad
created by main.main in goroutine 1
	github.com/ollama/ollama/llama/runner/runner.go:975 +0xc6c
time=2024-12-23T13:31:57.571+05:30 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: llama_model_loader: failed to load model from"
[GIN] 2024/12/23 - 13:31:57 | 500 |     335.213ms |  192.168.29.160 | POST     "/v1/chat/completions"

This repeats 3 times on each completion request, possibly due to retry logic.

<!-- gh-comment-id:2559115338 --> @a1ananth commented on GitHub (Dec 23, 2024): I am having this same issue with v 0.5.4. It works fine if I downgrade, but due to auto-update, it crashes again. My logs are: ``` time=2024-12-23T13:31:57.296+05:30 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=E:\AI\Ollama\models\blobs\sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463 gpu=GPU-345e1da7-095d-c43e-50f9-cb9f21b7d201 parallel=4 available=11379396608 required="5.6 GiB" time=2024-12-23T13:31:57.314+05:30 level=INFO source=server.go:104 msg="system memory" total="63.9 GiB" free="41.7 GiB" free_swap="49.2 GiB" time=2024-12-23T13:31:57.314+05:30 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.6 GiB" memory.required.partial="5.6 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[5.6 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" time=2024-12-23T13:31:57.315+05:30 level=INFO source=server.go:160 msg="user override" OLLAMA_LLM_LIBRARY=cuda_v12 path=D:\Programs\AI\ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe time=2024-12-23T13:31:57.317+05:30 level=INFO source=server.go:376 msg="starting llama server" cmd="D:\\Programs\\AI\\ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe runner --model E:\\AI\\Ollama\\models\\blobs\\sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463 --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 8 --no-mmap --parallel 4 --port 59599" time=2024-12-23T13:31:57.320+05:30 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-12-23T13:31:57.320+05:30 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2024-12-23T13:31:57.320+05:30 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2024-12-23T13:31:57.389+05:30 level=INFO source=runner.go:941 msg="starting go runner" time=2024-12-23T13:31:57.389+05:30 level=INFO source=runner.go:942 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(clang)" threads=8 time=2024-12-23T13:31:57.390+05:30 level=INFO source=.:0 msg="Server listening on 127.0.0.1:8080" gguf_init_from_file: failed to open '': 'Invalid argument' llama_model_load: error loading model: llama_model_loader: failed to load model from llama_load_model_from_file: failed to load model panic: unable to load model: goroutine 6 [running]: main.(*Server).loadModel(0xc0000a8120, {0x0, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc0000241c0, 0x0}, ...) github.com/ollama/ollama/llama/runner/runner.go:863 +0x3ad created by main.main in goroutine 1 github.com/ollama/ollama/llama/runner/runner.go:975 +0xc6c time=2024-12-23T13:31:57.571+05:30 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: llama_model_loader: failed to load model from" [GIN] 2024/12/23 - 13:31:57 | 500 | 335.213ms | 192.168.29.160 | POST "/v1/chat/completions" ``` This repeats 3 times on each completion request, possibly due to retry logic.
Author
Owner

@rick-github commented on GitHub (Dec 23, 2024):

It's not the same issue, your runners are crashing rather than going slow. Create a new issue and include full server logs.

<!-- gh-comment-id:2559136088 --> @rick-github commented on GitHub (Dec 23, 2024): It's not the same issue, your runners are crashing rather than going slow. Create a new issue and include full server logs.
Author
Owner

@jessegross commented on GitHub (Dec 23, 2024):

@wkevin What was the previous version that you rolled back to that didn't have the issue?

<!-- gh-comment-id:2560050514 --> @jessegross commented on GitHub (Dec 23, 2024): @wkevin What was the previous version that you rolled back to that didn't have the issue?
Author
Owner

@wkevin commented on GitHub (Dec 25, 2024):

@wkevin What was the previous version that you rolled back to that didn't have the issue?

0.3.10

I have run 0.5.4 again for two days, and timeout hasn't reoccurred。

<!-- gh-comment-id:2561926796 --> @wkevin commented on GitHub (Dec 25, 2024): > @wkevin What was the previous version that you rolled back to that didn't have the issue? 0.3.10 I have run 0.5.4 again for two days, and timeout hasn't reoccurred。
Author
Owner

@rick-github commented on GitHub (Dec 26, 2024):

I speculate that the problem @wkevin had was a GPU disconnect (https://github.com/ollama/ollama/issues/6928) or some other Nvidia error. The runners stay loaded but because they can no longer use the GPU, they fallback to doing inference on the CPU. If the client has a deadline that assumes a GPU is doing inference, then it may be too short for the case where the CPU is being used. What deadlines have you set on the API calls?

<!-- gh-comment-id:2562032298 --> @rick-github commented on GitHub (Dec 26, 2024): I speculate that the problem @wkevin had was a GPU disconnect (https://github.com/ollama/ollama/issues/6928) or some other Nvidia error. The runners stay loaded but because they can no longer use the GPU, they fallback to doing inference on the CPU. If the client has a deadline that assumes a GPU is doing inference, then it may be too short for the case where the CPU is being used. What deadlines have you set on the API calls?
Author
Owner

@wkevin commented on GitHub (Dec 26, 2024):

I speculate that the problem @wkevin had was a GPU disconnect (#6928) or some other Nvidia error. The runners stay loaded but because they can no longer use the GPU, they fallback to doing inference on the CPU. If the client has a deadline that assumes a GPU is doing inference, then it may be too short for the case where the CPU is being used. What deadlines have you set on the API calls?

Thanks. But process is 100% GPU in ollama ps output.

I use VSCode + Continue, no specialized configuration of deadlines.

<!-- gh-comment-id:2562104567 --> @wkevin commented on GitHub (Dec 26, 2024): > I speculate that the problem @wkevin had was a GPU disconnect (#6928) or some other Nvidia error. The runners stay loaded but because they can no longer use the GPU, they fallback to doing inference on the CPU. If the client has a deadline that assumes a GPU is doing inference, then it may be too short for the case where the CPU is being used. What deadlines have you set on the API calls? Thanks. But process is 100% GPU in `ollama ps` output. I use VSCode + Continue, no specialized configuration of deadlines.
Author
Owner

@wkevin commented on GitHub (Dec 26, 2024):

The good news is that qwen2.5-coder:14b on the 0.5.4 is much faster to generate than 0.3.10. Many thanks to the ollama team!

<!-- gh-comment-id:2562105890 --> @wkevin commented on GitHub (Dec 26, 2024): The good news is that qwen2.5-coder:14b on the 0.5.4 is much faster to generate than 0.3.10. Many thanks to the ollama team!
Author
Owner

@rick-github commented on GitHub (Dec 26, 2024):

Thanks. But process is 100% GPU in ollama ps output.

Yes, the models were loaded when the GPU was accessible but then it was disconnected. Server logs would show what happened.

I use VSCode + Continue, no specialized configuration of deadlines.

What deadline does VSCode + Continue set?

<!-- gh-comment-id:2562108951 --> @rick-github commented on GitHub (Dec 26, 2024): > Thanks. But process is 100% GPU in ollama ps output. Yes, the models were loaded when the GPU was accessible but then it was disconnected. Server logs would show what happened. > I use VSCode + Continue, no specialized configuration of deadlines. What deadline does VSCode + Continue set?
Author
Owner

@wkevin commented on GitHub (Dec 26, 2024):

What deadline does VSCode + Continue set?

I don't know how to config API call's deadline in VSCode + Continue.

<!-- gh-comment-id:2562232689 --> @wkevin commented on GitHub (Dec 26, 2024): > What deadline does VSCode + Continue set? I don't know how to config API call's deadline in VSCode + Continue.
Author
Owner

@wkevin commented on GitHub (Dec 30, 2024):

I have run 0.5.4 for one week, timeout hasn't reoccurred。
Maybe this issue should close.

<!-- gh-comment-id:2564999692 --> @wkevin commented on GitHub (Dec 30, 2024): I have run 0.5.4 for one week, timeout hasn't reoccurred。 Maybe this issue should close.
Author
Owner

@rick-github commented on GitHub (Dec 30, 2024):

If the problem repeats, get the server logs before rolling back and re-open this issue.

<!-- gh-comment-id:2565135439 --> @rick-github commented on GitHub (Dec 30, 2024): If the problem repeats, get the server logs before rolling back and re-open this issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#67299