[GH-ISSUE #8255] question: the Windows version is very slow when accessing the API #31036

Closed
opened 2026-04-22 11:08:57 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @liu9187 on GitHub (Dec 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/8255

Excuse me, the Windows version is very slow when accessing the API. What is the reason? But using the command line is faster
system : windows
memory : 24G

Originally created by @liu9187 on GitHub (Dec 27, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/8255 Excuse me, the Windows version is very slow when accessing the API. What is the reason? But using the command line is faster system : windows memory : 24G
GiteaMirror added the performanceneeds more info labels 2026-04-22 11:08:57 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 27, 2024):

Example?

<!-- gh-comment-id:2563613135 --> @rick-github commented on GitHub (Dec 27, 2024): Example?
Author
Owner

@liu9187 commented on GitHub (Dec 28, 2024):

Example?

I started the service locally, and accessing the API locally took 77.2s.
Using the documentation example

<!-- gh-comment-id:2564330651 --> @liu9187 commented on GitHub (Dec 28, 2024): > Example? I started the service locally, and accessing the API locally took 77.2s. Using the documentation example
Author
Owner

@rick-github commented on GitHub (Dec 28, 2024):

The complete lack of information makes it difficult to diagnose the issue. Please supply a description of how you access the API and what request you send, how you access the command line and what request you send, and server logs.

<!-- gh-comment-id:2564555457 --> @rick-github commented on GitHub (Dec 28, 2024): The complete lack of information makes it difficult to diagnose the issue. Please supply a description of how you access the API and what request you send, how you access the command line and what request you send, and [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues).
Author
Owner

@liu9187 commented on GitHub (Dec 30, 2024):

So, let me start from the installation of Ollama:

First, I installed Ollama according to the documentation, and agreed to use ollama run llama3.2 to start the large model. Here, it should be noted that I did not use ollama serve to start the service, because I found that after Ollama was installed, the service was always running.

Then, I used Postman to access the local interface, for example:

post http://localhost:11434/api/generate
{
"model": "llama3.2",
"prompt": "Why is the sky blue?"
}

The result is very slow to access.

Snipaste_2024-12-30_09-52-25

<!-- gh-comment-id:2564946483 --> @liu9187 commented on GitHub (Dec 30, 2024): So, let me start from the installation of Ollama: First, I installed Ollama according to the documentation, and agreed to use ollama run llama3.2 to start the large model. Here, it should be noted that I did not use ollama serve to start the service, because I found that after Ollama was installed, the service was always running. Then, I used Postman to access the local interface, for example: post http://localhost:11434/api/generate { "model": "llama3.2", "prompt": "Why is the sky blue?" } The result is very slow to access. ![Snipaste_2024-12-30_09-52-25](https://github.com/user-attachments/assets/f0248d6f-dbca-40a3-b7fb-e0949bb64ca5)
Author
Owner

@liu9187 commented on GitHub (Dec 30, 2024):

server.log
This is my server log file

Could there be an issue with the configuration?

<!-- gh-comment-id:2564949790 --> @liu9187 commented on GitHub (Dec 30, 2024): [server.log](https://github.com/user-attachments/files/18271545/server.log) This is my server log file Could there be an issue with the configuration?
Author
Owner

@rick-github commented on GitHub (Dec 30, 2024):

time=2024-12-30T08:58:11.143+08:00 level=INFO source=gpu.go:334 msg="detected OS VRAM overhead" id=GPU-6b0e5297-4c6f-744d-6a8a-d23788c70983 library=cuda compute=6.1 driver=12.2 name="NVIDIA GeForce MX150" overhead="329.4 MiB"
time=2024-12-30T08:58:11.146+08:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-6b0e5297-4c6f-744d-6a8a-d23788c70983 library=cuda variant=v12 compute=6.1 driver=12.2 name="NVIDIA GeForce MX150" total="2.0 GiB" available="1.6 GiB"

You have a small GPU, only 2GiB VRAM on it.

time=2024-12-30T09:23:07.070+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=8 layers.split="" memory.available="[1.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.2 GiB" memory.required.partial="1.6 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[1.6 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB"
time=2024-12-30T09:23:07.082+08:00 level=INFO source=server.go:376 msg="starting llama server" cmd="D:\\ollama\\bin\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model D:\\ollama\\models\\blobs\\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --n-gpu-layers 8 --no-mmap --parallel 1 --port 56461"
[GIN] 2024/12/30 - 09:23:12 | 200 |    6.4089339s |       127.0.0.1 | POST     "/api/generate"

When you ran from the command line, there was 1.6GiB available for loading the model. ollama offloaded 8 of 29 layers, so not quite a third of the model was runing in VRAM. ollama answered the prompt in 6 seconds.

time=2024-12-30T09:35:43.650+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=3 layers.split="" memory.available="[1.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.3 GiB" memory.required.partial="1.3 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[1.3 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB"
time=2024-12-30T09:35:43.661+08:00 level=INFO source=server.go:376 msg="starting llama server" cmd="D:\\ollama\\bin\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model D:\\ollama\\models\\blobs\\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --n-gpu-layers 3 --no-mmap --parallel 1 --port 49401"
[GIN] 2024/12/30 - 09:39:44 | 200 |          4m1s |       127.0.0.1 | POST     "/api/generate"

When you sent the prompt from Postman, there was only 1.3GiB available, and only 3 of 29 layers or about 10% were running in VRAM. ollama took 241 seconds to respond.

A GPU with enough VRAM to load the entire model is required to get fast inference speed. You can either get a bigger GPU or use a smaller model, for example: qwen2.5:1.5b will fit in the available VRAM.

<!-- gh-comment-id:2564961258 --> @rick-github commented on GitHub (Dec 30, 2024): ``` time=2024-12-30T08:58:11.143+08:00 level=INFO source=gpu.go:334 msg="detected OS VRAM overhead" id=GPU-6b0e5297-4c6f-744d-6a8a-d23788c70983 library=cuda compute=6.1 driver=12.2 name="NVIDIA GeForce MX150" overhead="329.4 MiB" time=2024-12-30T08:58:11.146+08:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-6b0e5297-4c6f-744d-6a8a-d23788c70983 library=cuda variant=v12 compute=6.1 driver=12.2 name="NVIDIA GeForce MX150" total="2.0 GiB" available="1.6 GiB" ``` You have a small GPU, only 2GiB VRAM on it. ``` time=2024-12-30T09:23:07.070+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=8 layers.split="" memory.available="[1.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.2 GiB" memory.required.partial="1.6 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[1.6 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB" time=2024-12-30T09:23:07.082+08:00 level=INFO source=server.go:376 msg="starting llama server" cmd="D:\\ollama\\bin\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model D:\\ollama\\models\\blobs\\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --n-gpu-layers 8 --no-mmap --parallel 1 --port 56461" [GIN] 2024/12/30 - 09:23:12 | 200 | 6.4089339s | 127.0.0.1 | POST "/api/generate" ``` When you ran from the command line, there was 1.6GiB available for loading the model. ollama offloaded 8 of 29 layers, so not quite a third of the model was runing in VRAM. ollama answered the prompt in 6 seconds. ``` time=2024-12-30T09:35:43.650+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=3 layers.split="" memory.available="[1.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.3 GiB" memory.required.partial="1.3 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[1.3 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB" time=2024-12-30T09:35:43.661+08:00 level=INFO source=server.go:376 msg="starting llama server" cmd="D:\\ollama\\bin\\lib\\ollama\\runners\\cuda_v12_avx\\ollama_llama_server.exe runner --model D:\\ollama\\models\\blobs\\sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --n-gpu-layers 3 --no-mmap --parallel 1 --port 49401" [GIN] 2024/12/30 - 09:39:44 | 200 | 4m1s | 127.0.0.1 | POST "/api/generate" ``` When you sent the prompt from Postman, there was only 1.3GiB available, and only 3 of 29 layers or about 10% were running in VRAM. ollama took 241 seconds to respond. A GPU with enough VRAM to load the entire model is required to get fast inference speed. You can either get a bigger GPU or use a smaller model, for example: [qwen2.5:1.5b](https://ollama.com/library/qwen2.5:1.5b) will fit in the available VRAM.
Author
Owner

@liu9187 commented on GitHub (Dec 30, 2024):

OK, thank you very much for your answer @rick-github

<!-- gh-comment-id:2564965443 --> @liu9187 commented on GitHub (Dec 30, 2024): OK, thank you very much for your answer @rick-github
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#31036