[GH-ISSUE #7253] The issue regarding concurrent processing with multiple GPU cards #4609

Closed
opened 2026-04-12 15:31:40 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @goactiongo on GitHub (Oct 18, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7253

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Premise:

There are 4 GPU cards in the Linux server, and OLLAMA_SCHED_SPREAD=1 is set, with the aim of improving the model's inference efficiency through concurrent processing on multiple GPU cards.

My Scenario:

In the same process, I wish to sequentially call 3 different LLM models to handle the same task (such as summarizing long text), with the intention that users can see the different content summarized by the 3 different LLM models and compare the processing effects of different models.

After the process runs, it can be observed that each model runs on multiple GPU cards, but there are the following issues:

  1. After the first model finishes running, the second model reports an OOM error, and the third model sometimes succeeds and sometimes fails.
  2. Is this issue because after the first model finishes running, all GPU resources are not released completely, and the second model fails due to lack of GPU resources when it continues to run, while the third model succeeds if the GPU resources have been released, and fails if the GPU resources are still not completely released?
  3. If OLLAMA_SCHED_SPREAD=1 is not set, all three models will run successfully because ollama will use different GPU cards to handle the requests of the three models separately, but this method is slower because each model uses a single GPU card for processing.

My requirements are as follows:

  1. If OLLAMA_SCHED_SPREAD=1 is set, how can GPU resources be quickly released after the first model finishes running to ensure that subsequent models do not fail due to insufficient GPU resources?
  2. If solution 1 cannot be met, what methods can be used to improve model inference efficiency through concurrent processing on multiple GPU cards?

Other Questions (without setting OLLAMA_SCHED_SPREAD=1):

Ollama defaults to OLLAMA_NUM_PARALLEL=4, and if a single GPU card cannot meet the resources for 4 concurrent processes, ollama automatically sets PARALLEL=1. At this time, if a single GPU card can meet the resources needed for PARALLEL=1, one GPU card performs inference; and if a single GPU card cannot meet the resources needed for PARALLEL=1, ollama automatically uses 4 GPU cards to process, is this an automatic and default mechanism?

Summary:

The overall requirement is how to improve the efficiency of concurrent inference when there are multiple GPU cards, thereby enhancing the user experience.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.3.14

Originally created by @goactiongo on GitHub (Oct 18, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7253 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? ### Premise: There are 4 GPU cards in the Linux server, and OLLAMA_SCHED_SPREAD=1 is set, with the aim of improving the model's inference efficiency through concurrent processing on multiple GPU cards. ### My Scenario: In the same process, I wish to sequentially call 3 different LLM models to handle the same task (such as summarizing long text), with the intention that users can see the different content summarized by the 3 different LLM models and compare the processing effects of different models. After the process runs, it can be observed that each model runs on multiple GPU cards, but there are the following issues: 1. After the first model finishes running, the second model reports an OOM error, and the third model sometimes succeeds and sometimes fails. 2. Is this issue because after the first model finishes running, all GPU resources are not released completely, and the second model fails due to lack of GPU resources when it continues to run, while the third model succeeds if the GPU resources have been released, and fails if the GPU resources are still not completely released? 3. If OLLAMA_SCHED_SPREAD=1 is not set, all three models will run successfully because ollama will use different GPU cards to handle the requests of the three models separately, but this method is slower because each model uses a single GPU card for processing. ### My requirements are as follows: 1. If OLLAMA_SCHED_SPREAD=1 is set, how can GPU resources be quickly released after the first model finishes running to ensure that subsequent models do not fail due to insufficient GPU resources? 2. If solution 1 cannot be met, what methods can be used to improve model inference efficiency through concurrent processing on multiple GPU cards? ### Other Questions (without setting OLLAMA_SCHED_SPREAD=1): Ollama defaults to OLLAMA_NUM_PARALLEL=4, and if a single GPU card cannot meet the resources for 4 concurrent processes, ollama automatically sets PARALLEL=1. At this time, if a single GPU card can meet the resources needed for PARALLEL=1, one GPU card performs inference; and if a single GPU card cannot meet the resources needed for PARALLEL=1, ollama automatically uses 4 GPU cards to process, is this an automatic and default mechanism? ### Summary: The overall requirement is how to improve the efficiency of concurrent inference when there are multiple GPU cards, thereby enhancing the user experience. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.14
GiteaMirror added the bug label 2026-04-12 15:31:40 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 18, 2024):

When you get an OOM error, is all VRAM allocated? Server logs and the output of nvidia-smi will aid in debugging.

<!-- gh-comment-id:2422764044 --> @rick-github commented on GitHub (Oct 18, 2024): When you get an OOM error, is all VRAM allocated? [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) and the output of `nvidia-smi` will aid in debugging.
Author
Owner

@goactiongo commented on GitHub (Oct 19, 2024):

Hi @rick-github
I have analyzed the relevant logs and placed the log files in this issue, and I have also copied them in issue #7146. Please close either one of them.

Thank you for your reply. I have conducted the following two tests, and I am unsure how to handle some issues, so I need your further assistance.

1. Overall Situation Description

1.1 In the same process, three models are sequentially called through the API interface to handle the content summarization task. The models include llama3.1:8b, glm4:9b, llama3.2:latest, and each model has the following parameters set:

"num_ctx": 121000,
"num_predict": 9000

1.2. If the environment variable Environment="OLLAMA_SCHED_SPREAD=1" is not set, all three models will run successfully in sequence, but the inference time is relatively long.

Log file ollama2.log
ollama2.log

1.3. After setting Environment="OLLAMA_SCHED_SPREAD=1".

In order to improve the inference efficiency of the three models and to fully utilize four GPU cards for concurrent processing, I set the environment variable Environment="OLLAMA_SCHED_SPREAD=1" as you instructed. However, after multiple tests, the first model runs successfully every time, the second model fails almost every time, and the third model sometimes succeeds and sometimes fails.
Log file ollama1.log
ollama1.log

1.4. In the ollama1.log log, the first model succeeds, and the second and third both fail.

Model 2 API Error Log as followed

{"message":"Request failed with status code 500","data":{"error":"llama runner process has terminated: error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449'"}}

Model 3 API Error Log as followed

{"message":"Request failed with status code 500","data":{"error":"an unknown error was encountered while running the model CUDA error: out of memory\n current device: 3, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:376\n cuMemCreate(&handle, reservesize, &prop, 0)\n/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error"}}

1.5 Analysis of ollama1.log log

According to your previous instructions, I interpreted the log information in ollama1.log and conducted an analysis. There are some areas where I do not understand or may have misunderstood, and I hope for your assistance.

Note: The following analysis and most of the logs are from ollama1.log. Only in sections 4.5.1 and 5.5 did I compare the relevant information from ollama2.log

2. Four GPU cards, the available resources are 23.3 GiB, 23.3 GiB, 16.8 GiB, 9.7 GiB, as follows,

October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]"

3. First Model: llama3.1:8b

3.1 By default, OLLAMA_NUM_PARALLEL=4, the required resources are partial_offload="32.3 GiB" full_offload="32.3 GiB", but none of the GPU cards can meet this requirement. As follows,

October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB"
October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB"
October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB"
October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB"
October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.630+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"

3.2 From the following log, it can be observed that OLLAMA automatically sets parallel=1, and the required resources are required="55.3 GiB" (I do not understand why the required resources are 55.3G at this time, which is greater than the 32.3G shown in 3.1? Additionally, does the parallel=1 in the following log indicate that the four concurrent processes have been reduced to one, and since it has been reduced to parallel=1, why does it require more resources), as follows

October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.630+08:00 level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe library=cuda parallel=1 required="55.3 GiB"

3.3 The model runs on four GPU cards. As follows.

Question: If Environment="OLLAMA_SCHED_SPREAD=1" is not set, why does this model still run on four GPU cards (which does not meet my expectation), while the other two models only run on one GPU card (which meets the expectation), and for this part, you can refer to the log file ollama2.log

October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.632+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split=10,11,11,1 memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="55.3 GiB" memory.required.partial="55.3 GiB" memory.required.kv="14.8 GiB" memory.required.allocations="[14.9 GiB 15.5 GiB 15.3 GiB 9.7 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="18.0 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="8.1 GiB" memory.graph.partial="8.1 GiB"

3.4 The first model works normally, as follows:

October 19 22:19:34 gpu ollama[60399]: DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347574
October 19 22:19:34 gpu ollama[60399]: DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=58174 slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347574
October 19 22:19:34 gpu ollama[60399]: DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347574
October 19 22:21:05 gpu ollama[60399]: DEBUG [print_timings] prompt eval time     =   44130.26 ms / 58174 tokens (    0.76 ms per token,  1318.23 tokens per second) | n_prompt_tokens_processed=58174 n_tokens_second=1318.2337027993692 slot_id=0 t_prompt_processing=44130.263 t_token=0.7585908309554096 task_id=2 tid="139958483587072" timestamp=1729347665
October 19 22:21:05 gpu ollama[60399]: DEBUG [print_timings] generation eval time =   47231.63 ms /   640 runs   (   73.80 ms per token,    13.55 tokens per second) | n_decoded=640 n_tokens_second=13.55024251017226 slot_id=0 t_token=73.7994171875 t_token_generation=47231.627 task_id=2 tid="139958483587072" timestamp=1729347665
October 19 22:21:05 gpu ollama[60399]: DEBUG [print_timings]           total time =   91361.89 ms | slot_id=0 t_prompt_processing=44130.263 t_token_generation=47231.627 t_total=91361.89 task_id=2 tid="139958483587072" timestamp=1729347665
October 19 22:21:05 gpu ollama[60399]: DEBUG [update_slots] slot released | n_cache_tokens=58814 n_ctx=121024 n_past=58813 n_system_tokens=0 slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347665 truncated=false
October 19 22:21:05 gpu ollama[60399]: DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=55272 status=200 tid="139956475393792" timestamp=1729347665
October 19 22:21:05 gpu ollama[60399]: DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=645 tid="139958483587072" timestamp=1729347665
October 19 22:21:06 gpu ollama[60399]: DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=55274 status=200 tid="139956467001088" timestamp=1729347666
October 19 22:21:06 gpu ollama[60399]: [GIN] 2024/10/19 - 22:21:06 | 200 |         1m38s |    172.16.1.219 | POST     "/api/generate"

4. Second Model: glm4:9b

4.1 After the first model ends, perhaps because the GPU resources have not been completely released at this time (I am not sure if this is the reason), the available resources of each GPU cannot meet the model's needs of partial_offload="30.9 GiB" full_offload="30.9 GiB". As follows,

10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.897+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[8.8 GiB 7.1 GiB 6.5 GiB 16.1 MiB]"
10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="8.8 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="7.1 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="6.5 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.1 MiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"

4.2 Next, I noticed 'resetting model to expire immediately to make room’

October 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.899+08:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0

4.3 Then, I noticed the llama server stopped and regained GPU resources (consistent with the initial available resources).

October 19 22:21:09 gpu ollama[60399]: time=2024-10-19T22:21:09.039+08:00 level=DEBUG source=server.go:1044 msg="stopping llama server"
October 19 22:21:09 gpu ollama[60399]: time=2024-10-19T22:21:09.039+08:00 level=DEBUG source=server.go:1050 msg="waiting for llama server to exit"
October 19 22:21:09 gpu ollama[60399]: time=2024-10-19T22:21:09.290+08:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="125.4 GiB" before.free="108.2 GiB" before.free_swap="3.7 GiB" now.total="125.4 GiB" now.free="108.5 GiB" now.free_swap="3.7 GiB"
October 19 22:21:09 gpu ollama[60399]: CUDA driver version: 12.2
October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.002+08:00 level=DEBUG source=server.go:1054 msg="llama server stopped"
October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.003+08:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.220+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="8.8 GiB" now.total="23.5 GiB" now.free="9.7 GiB" now.used="13.8 GiB"
October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.510+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="16.1 MiB" now.total="23.5 GiB" now.free="16.8 GiB" now.used="6.7 GiB"
October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.778+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="7.1 GiB" now.total="23.5 GiB" now.free="23.3 GiB" now.used="232.9 MiB"
October 19 22:21:11 gpu ollama[60399]: time=2024-10-19T22:21:11.017+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="6.5 GiB" now.total="23.5 GiB" now.free="23.3 GiB" now.used="232.9 MiB"

4.4 However, the available resources of each GPU card still cannot meet the needs of the second model.

Question: Is OLLAMA_NUM_PARALLEL still set to 4 at this time?

10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.112+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]"
10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"

4.5 Then the model uses four GPU cards to run, and the available resources memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" can meet the memory.required.full="43.1 GiB". As follows,

Question: At this time, OLLAMA_NUM_PARALLEL=4, why doesn't it automatically set OLLAMA_NUM_PARALLEL=1 like the first model? And assess whether any one of the GPU cards can meet the resources required by OLLAMA_NUM_PARALLEL=1?

October 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.114+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=41 layers.split=12,12,12,5 memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="43.1 GiB" memory.required.partial="43.1 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[11.1 GiB 11.3 GiB 11.1 GiB 9.5 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.4 GiB" memory.weights.nonrepeating="485.6 MiB" memory.graph.full="7.8 GiB" memory.graph.partial="7.8 GiB"

4.5.1 The following information is from ollama2.log, and it can be observed that when OLLAMA_SCHED_SPREAD=1 is not set, this model eventually runs on one of the GPU cards and ultimately succeeds. (Different from 4.5 and 4.6 above).

10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.306+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[16.8 GiB]"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[9.7 GiB]"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 gpu=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 parallel=1 available=24986779648 required="17.7 GiB"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="109.8 GiB" free_swap="3.7 GiB"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="17.7 GiB" memory.required.partial="17.7 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[17.7 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.4 GiB" memory.weights.nonrepeating="485.6 MiB" memory.graph.full="7.6 GiB" memory.graph.partial="7.8 GiB"

4.6 Next, I saw the OOM error message in the logs.

I don't understand why these errors occurred, allocating 8574.52 MiB on device 3: cudaMalloc failed: out of memory. From the above logs, it seems that the available resources on device 3 should be sufficient.

10月 19 22:21:15 gpu ollama[60399]: llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
10月 19 22:21:15 gpu ollama[60399]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8574.52 MiB on device 3: cudaMalloc failed: out of memory
10月 19 22:21:15 gpu ollama[60399]: ggml_gallocr_reserve_n: failed to allocate CUDA3 buffer of size 8991031296
10月 19 22:21:15 gpu ollama[60399]: llama_new_context_with_model: failed to allocate compute buffers
10月 19 22:21:16 gpu ollama[60399]: llama_init_from_gpt_params: error: failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449'
10月 19 22:21:16 gpu ollama[60399]: ERROR [load_model] unable to load model | model="/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449" tid="140035751489536" timestamp=1729347676
10月 19 22:21:16 gpu ollama[60399]: terminate called without an active exception
10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.326+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.515+08:00 level=DEBUG source=server.go:428 msg="llama runner terminated" error="signal: aborted"
10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.576+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449'"
10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=sched.go:458 msg="triggering expiration for failed load" model=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449
10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449
10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449
10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="125.4 GiB" before.free="109.8 GiB" before.free_swap="3.7 GiB" now.total="125.4 GiB" now.free="109.8 GiB" now.free_swap="3.7 GiB"
10月 19 22:21:16 gpu ollama[60399]: [GIN] 2024/10/19 - 22:21:16 | 500 | 10.016638054s |    172.16.1.219 | POST     "/api/generate"

4.7 Upon checking the logs through the frontend application, the API interface returned the following error message (this should be due to the errors mentioned above):

{"message":"Request failed with status code 500","data":{"error":"llama runner process has terminated: error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449'"}

5. Third Model: llama3.2:latest

5.1 At this point, the resources of the four GPU cards have been completely released. By default, OLLAMA_NUM_PARALLEL=4, and the required resources are partial_offload="24.9 GiB" full_offload="24.9 GiB", but none of the individual GPU cards can meet the required resources. As follows,

10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.064+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"

5.2 The following log shows parallel=1, and the model runs on four GPU cards.

Question 1: Why does parallel=1 require more resources, 43.7G, memory.required.full="43.7 GiB", while only 24.9G was needed in 5.1?
Question 2: Why do models 1 and 3 automatically downgrade from parallel=4 to parallel=1, but the second model does not automatically adjust to parallel=1?

10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff library=cuda parallel=1 required="43.7 GiB"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="109.7 GiB" free_swap="3.7 GiB"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.066+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split=8,9,8,4 memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="43.7 GiB" memory.required.partial="43.7 GiB" memory.required.kv="12.9 GiB" memory.required.allocations="[11.4 GiB 11.7 GiB 11.4 GiB 9.3 GiB]" memory.weights.total="14.5 GiB" memory.weights.repeating="14.2 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="6.2 GiB" memory.graph.partial="6.2 GiB"

5.3 According to the above analysis, the available resources of the four GPU cards (23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB) can meet the model's resource requirement of 43.7G, why does the third model suddenly report an OOM error.

10月 19 22:21:46 gpu ollama[60399]: CUDA error: out of memory

and

10月 19 22:21:47 gpu ollama[60399]: No symbol table is loaded.  Use the "file" command.
10月 19 22:21:47 gpu ollama[60399]: [Inferior 1 (process 60845) detached]
10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.580+08:00 level=DEBUG source=server.go:1044 msg="stopping llama server"
10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.581+08:00 level=DEBUG source=server.go:1050 msg="waiting for llama server to exit"
10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.908+08:00 level=DEBUG source=server.go:428 msg="llama runner terminated" error="signal: aborted"
10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.908+08:00 level=DEBUG source=server.go:1054 msg="llama server stopped"
10月 19 22:21:47 gpu ollama[60399]: [GIN] 2024/10/19 - 22:21:47 | 500 | 30.138749435s |    172.16.1.219 | POST     "/api/generate"

5.4 Upon checking the logs through the frontend application, the API interface returned the following error message:

{"message":"Request failed with status code 500","data":{"error":"an unknown error was encountered while running the model CUDA error: out of memory\n current device: 3, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:376\n cuMemCreate(&handle, reservesize, &prop, 0)\n/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error"}

5.5 The following log is from ollama2.log, and it can be observed that the model eventually runs on a single GPU card and ultimately succeeds. (Different from 5.2 and 5.3 above)

10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.192+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.193+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.193+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.193+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[16.8 GiB]"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.197+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.198+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.198+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[9.7 GiB]"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.199+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.199+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.199+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[4.3 GiB]"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.200+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="4.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.200+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.200+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 parallel=1 available=24986779648 required="21.6 GiB"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=DEBUG source=sched.go:249 msg="new model fits with existing models, loading"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="109.1 GiB" free_swap="3.7 GiB"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.203+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.6 GiB" memory.required.partial="21.6 GiB" memory.required.kv="12.9 GiB" memory.required.allocations="[21.6 GiB]" memory.weights.total="14.5 GiB" memory.weights.repeating="14.2 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="5.8 GiB" memory.graph.partial="6.2 GiB"

6 I monitor the GPU operation by executing the command gpustat -i 1, as follows:

6.1 If the environment variable is set as Environment="OLLAMA_SCHED_SPREAD=1" (as shown in ollama1.log, model 1 runs successfully, while models 2 and 3 fail)

All three models run on GPUs 1, 2, and 3 (which is basically as expected), but I do not know why GPU 0 has not been used or is only occasionally occupied.

6.2 If the environment variable is not set as Environment="OLLAMA_SCHED_SPREAD=1" (as shown in ollama2.log, all three models run successfully)

First model: It runs on GPUs 1, 2, and 3. I do not know why GPU 1 has not been used or is only occasionally occupied, and I am unclear why this model can still run on multiple GPU cards without setting the environment variable Environment="OLLAMA_SCHED_SPREAD=1".
Second model: It runs only on GPU 2 (as expected).
Third model: It runs only on GPU 3 (as expected).

<!-- gh-comment-id:2424100320 --> @goactiongo commented on GitHub (Oct 19, 2024): Hi @rick-github I have analyzed the relevant logs and placed the log files in this issue, and I have also copied them in issue #7146. Please close either one of them. Thank you for your reply. I have conducted the following two tests, and I am unsure how to handle some issues, so I need your further assistance. ## 1. Overall Situation Description ### 1.1 In the same process, three models are sequentially called through the API interface to handle the content summarization task. The models include llama3.1:8b, glm4:9b, llama3.2:latest, and each model has the following parameters set: ``` "num_ctx": 121000, "num_predict": 9000 ``` ### 1.2. If the environment variable Environment="OLLAMA_SCHED_SPREAD=1" is not set, all three models will run successfully in sequence, but the inference time is relatively long. Log file ollama2.log [ollama2.log](https://github.com/user-attachments/files/17446683/ollama2.log) ### 1.3. After setting Environment="OLLAMA_SCHED_SPREAD=1". In order to improve the inference efficiency of the three models and to fully utilize four GPU cards for concurrent processing, I set the environment variable Environment="OLLAMA_SCHED_SPREAD=1" as you instructed. However, after multiple tests, the first model runs successfully every time, the second model fails almost every time, and the third model sometimes succeeds and sometimes fails. Log file ollama1.log [ollama1.log](https://github.com/user-attachments/files/17446687/ollama1.log) ### 1.4. In the ollama1.log log, the first model succeeds, and the second and third both fail. **Model 2 API Error Log as followed** ``` {"message":"Request failed with status code 500","data":{"error":"llama runner process has terminated: error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449'"}} ``` **Model 3 API Error Log as followed** ``` {"message":"Request failed with status code 500","data":{"error":"an unknown error was encountered while running the model CUDA error: out of memory\n current device: 3, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:376\n cuMemCreate(&handle, reservesize, &prop, 0)\n/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error"}} ``` ### 1.5 Analysis of ollama1.log log According to your previous instructions, I interpreted the log information in ollama1.log and conducted an analysis. There are some areas where I do not understand or may have misunderstood, and I hope for your assistance. **Note: The following analysis and most of the logs are from ollama1.log. Only in sections 4.5.1 and 5.5 did I compare the relevant information from ollama2.log** ## 2. Four GPU cards, the available resources are 23.3 GiB, 23.3 GiB, 16.8 GiB, 9.7 GiB, as follows, ``` October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" ``` ## 3. First Model: llama3.1:8b ### 3.1 By default, OLLAMA_NUM_PARALLEL=4, the required resources are partial_offload="32.3 GiB" full_offload="32.3 GiB", but none of the GPU cards can meet this requirement. As follows, ``` October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB" October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB" October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB" October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB" October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.630+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" ``` ### 3.2 From the following log, it can be observed that OLLAMA automatically sets parallel=1, and the required resources are required="55.3 GiB" (I do not understand why the required resources are 55.3G at this time, which is greater than the 32.3G shown in 3.1? Additionally, does the parallel=1 in the following log indicate that the four concurrent processes have been reduced to one, and since it has been reduced to parallel=1, why does it require more resources), as follows ``` October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.630+08:00 level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe library=cuda parallel=1 required="55.3 GiB" ``` ### 3.3 The model runs on four GPU cards. As follows. Question: If Environment="OLLAMA_SCHED_SPREAD=1" is not set, why does this model still run on four GPU cards (which does not meet my expectation), while the other two models only run on one GPU card (which meets the expectation), and for this part, you can refer to the log file ollama2.log ``` October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.632+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split=10,11,11,1 memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="55.3 GiB" memory.required.partial="55.3 GiB" memory.required.kv="14.8 GiB" memory.required.allocations="[14.9 GiB 15.5 GiB 15.3 GiB 9.7 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="18.0 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="8.1 GiB" memory.graph.partial="8.1 GiB" ``` ### 3.4 The first model works normally, as follows: ``` October 19 22:19:34 gpu ollama[60399]: DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347574 October 19 22:19:34 gpu ollama[60399]: DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=58174 slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347574 October 19 22:19:34 gpu ollama[60399]: DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347574 October 19 22:21:05 gpu ollama[60399]: DEBUG [print_timings] prompt eval time = 44130.26 ms / 58174 tokens ( 0.76 ms per token, 1318.23 tokens per second) | n_prompt_tokens_processed=58174 n_tokens_second=1318.2337027993692 slot_id=0 t_prompt_processing=44130.263 t_token=0.7585908309554096 task_id=2 tid="139958483587072" timestamp=1729347665 October 19 22:21:05 gpu ollama[60399]: DEBUG [print_timings] generation eval time = 47231.63 ms / 640 runs ( 73.80 ms per token, 13.55 tokens per second) | n_decoded=640 n_tokens_second=13.55024251017226 slot_id=0 t_token=73.7994171875 t_token_generation=47231.627 task_id=2 tid="139958483587072" timestamp=1729347665 October 19 22:21:05 gpu ollama[60399]: DEBUG [print_timings] total time = 91361.89 ms | slot_id=0 t_prompt_processing=44130.263 t_token_generation=47231.627 t_total=91361.89 task_id=2 tid="139958483587072" timestamp=1729347665 October 19 22:21:05 gpu ollama[60399]: DEBUG [update_slots] slot released | n_cache_tokens=58814 n_ctx=121024 n_past=58813 n_system_tokens=0 slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347665 truncated=false October 19 22:21:05 gpu ollama[60399]: DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=55272 status=200 tid="139956475393792" timestamp=1729347665 October 19 22:21:05 gpu ollama[60399]: DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=645 tid="139958483587072" timestamp=1729347665 October 19 22:21:06 gpu ollama[60399]: DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=55274 status=200 tid="139956467001088" timestamp=1729347666 October 19 22:21:06 gpu ollama[60399]: [GIN] 2024/10/19 - 22:21:06 | 200 | 1m38s | 172.16.1.219 | POST "/api/generate" ``` ## 4. Second Model: glm4:9b ### 4.1 After the first model ends, perhaps because the GPU resources have not been completely released at this time (I am not sure if this is the reason), the available resources of each GPU cannot meet the model's needs of partial_offload="30.9 GiB" full_offload="30.9 GiB". As follows, ``` 10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.897+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[8.8 GiB 7.1 GiB 6.5 GiB 16.1 MiB]" 10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="8.8 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="7.1 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="6.5 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.1 MiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" ``` ### 4.2 Next, I noticed 'resetting model to expire immediately to make room’ ``` October 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.899+08:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0 ``` ### 4.3 Then, I noticed the llama server stopped and regained GPU resources (consistent with the initial available resources). ``` October 19 22:21:09 gpu ollama[60399]: time=2024-10-19T22:21:09.039+08:00 level=DEBUG source=server.go:1044 msg="stopping llama server" October 19 22:21:09 gpu ollama[60399]: time=2024-10-19T22:21:09.039+08:00 level=DEBUG source=server.go:1050 msg="waiting for llama server to exit" October 19 22:21:09 gpu ollama[60399]: time=2024-10-19T22:21:09.290+08:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="125.4 GiB" before.free="108.2 GiB" before.free_swap="3.7 GiB" now.total="125.4 GiB" now.free="108.5 GiB" now.free_swap="3.7 GiB" October 19 22:21:09 gpu ollama[60399]: CUDA driver version: 12.2 October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.002+08:00 level=DEBUG source=server.go:1054 msg="llama server stopped" October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.003+08:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.220+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="8.8 GiB" now.total="23.5 GiB" now.free="9.7 GiB" now.used="13.8 GiB" October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.510+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="16.1 MiB" now.total="23.5 GiB" now.free="16.8 GiB" now.used="6.7 GiB" October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.778+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="7.1 GiB" now.total="23.5 GiB" now.free="23.3 GiB" now.used="232.9 MiB" October 19 22:21:11 gpu ollama[60399]: time=2024-10-19T22:21:11.017+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="6.5 GiB" now.total="23.5 GiB" now.free="23.3 GiB" now.used="232.9 MiB" ``` ### 4.4 However, the available resources of each GPU card still cannot meet the needs of the second model. Question: Is OLLAMA_NUM_PARALLEL still set to 4 at this time? ``` 10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.112+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" 10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" ``` ### 4.5 Then the model uses four GPU cards to run, and the available resources memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" can meet the memory.required.full="43.1 GiB". As follows, Question: At this time, OLLAMA_NUM_PARALLEL=4, why doesn't it automatically set OLLAMA_NUM_PARALLEL=1 like the first model? And assess whether any one of the GPU cards can meet the resources required by OLLAMA_NUM_PARALLEL=1? ``` October 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.114+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=41 layers.split=12,12,12,5 memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="43.1 GiB" memory.required.partial="43.1 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[11.1 GiB 11.3 GiB 11.1 GiB 9.5 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.4 GiB" memory.weights.nonrepeating="485.6 MiB" memory.graph.full="7.8 GiB" memory.graph.partial="7.8 GiB" ``` ### 4.5.1 The following information is from ollama2.log, and it can be observed that when ```OLLAMA_SCHED_SPREAD=1``` is not set, this model eventually runs on one of the GPU cards and ultimately succeeds. (Different from 4.5 and 4.6 above). ``` 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.306+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[16.8 GiB]" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[9.7 GiB]" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 gpu=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 parallel=1 available=24986779648 required="17.7 GiB" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="109.8 GiB" free_swap="3.7 GiB" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="17.7 GiB" memory.required.partial="17.7 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[17.7 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.4 GiB" memory.weights.nonrepeating="485.6 MiB" memory.graph.full="7.6 GiB" memory.graph.partial="7.8 GiB" ``` ### 4.6 Next, I saw the OOM error message in the logs. I don't understand why these errors occurred, allocating 8574.52 MiB on device 3: cudaMalloc failed: out of memory. From the above logs, it seems that the available resources on device 3 should be sufficient. ``` 10月 19 22:21:15 gpu ollama[60399]: llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) 10月 19 22:21:15 gpu ollama[60399]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8574.52 MiB on device 3: cudaMalloc failed: out of memory 10月 19 22:21:15 gpu ollama[60399]: ggml_gallocr_reserve_n: failed to allocate CUDA3 buffer of size 8991031296 10月 19 22:21:15 gpu ollama[60399]: llama_new_context_with_model: failed to allocate compute buffers 10月 19 22:21:16 gpu ollama[60399]: llama_init_from_gpt_params: error: failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449' 10月 19 22:21:16 gpu ollama[60399]: ERROR [load_model] unable to load model | model="/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449" tid="140035751489536" timestamp=1729347676 10月 19 22:21:16 gpu ollama[60399]: terminate called without an active exception 10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.326+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" 10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.515+08:00 level=DEBUG source=server.go:428 msg="llama runner terminated" error="signal: aborted" 10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.576+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449'" 10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=sched.go:458 msg="triggering expiration for failed load" model=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="125.4 GiB" before.free="109.8 GiB" before.free_swap="3.7 GiB" now.total="125.4 GiB" now.free="109.8 GiB" now.free_swap="3.7 GiB" 10月 19 22:21:16 gpu ollama[60399]: [GIN] 2024/10/19 - 22:21:16 | 500 | 10.016638054s | 172.16.1.219 | POST "/api/generate" ``` ### 4.7 Upon checking the logs through the frontend application, the API interface returned the following error message (this should be due to the errors mentioned above): ``` {"message":"Request failed with status code 500","data":{"error":"llama runner process has terminated: error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449'"} ``` ## 5. Third Model: llama3.2:latest ### 5.1 At this point, the resources of the four GPU cards have been completely released. By default, OLLAMA_NUM_PARALLEL=4, and the required resources are partial_offload="24.9 GiB" full_offload="24.9 GiB", but none of the individual GPU cards can meet the required resources. As follows, ``` 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.064+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" ``` ### 5.2 The following log shows parallel=1, and the model runs on four GPU cards. Question 1: Why does parallel=1 require more resources, 43.7G, memory.required.full="43.7 GiB", while only 24.9G was needed in 5.1? Question 2: Why do models 1 and 3 automatically downgrade from parallel=4 to parallel=1, but the second model does not automatically adjust to parallel=1? ``` 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff library=cuda parallel=1 required="43.7 GiB" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="109.7 GiB" free_swap="3.7 GiB" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.066+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split=8,9,8,4 memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="43.7 GiB" memory.required.partial="43.7 GiB" memory.required.kv="12.9 GiB" memory.required.allocations="[11.4 GiB 11.7 GiB 11.4 GiB 9.3 GiB]" memory.weights.total="14.5 GiB" memory.weights.repeating="14.2 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="6.2 GiB" memory.graph.partial="6.2 GiB" ``` ### 5.3 According to the above analysis, the available resources of the four GPU cards (23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB) can meet the model's resource requirement of 43.7G, why does the third model suddenly report an OOM error. ``` 10月 19 22:21:46 gpu ollama[60399]: CUDA error: out of memory ``` and ``` 10月 19 22:21:47 gpu ollama[60399]: No symbol table is loaded. Use the "file" command. 10月 19 22:21:47 gpu ollama[60399]: [Inferior 1 (process 60845) detached] 10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.580+08:00 level=DEBUG source=server.go:1044 msg="stopping llama server" 10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.581+08:00 level=DEBUG source=server.go:1050 msg="waiting for llama server to exit" 10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.908+08:00 level=DEBUG source=server.go:428 msg="llama runner terminated" error="signal: aborted" 10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.908+08:00 level=DEBUG source=server.go:1054 msg="llama server stopped" 10月 19 22:21:47 gpu ollama[60399]: [GIN] 2024/10/19 - 22:21:47 | 500 | 30.138749435s | 172.16.1.219 | POST "/api/generate" ``` ### 5.4 Upon checking the logs through the frontend application, the API interface returned the following error message: ``` {"message":"Request failed with status code 500","data":{"error":"an unknown error was encountered while running the model CUDA error: out of memory\n current device: 3, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:376\n cuMemCreate(&handle, reservesize, &prop, 0)\n/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error"} ``` ### 5.5 The following log is from ollama2.log, and it can be observed that the model eventually runs on a single GPU card and ultimately succeeds. (Different from 5.2 and 5.3 above) ``` 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.192+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.193+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.193+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.193+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[16.8 GiB]" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.197+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.198+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.198+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[9.7 GiB]" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.199+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.199+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.199+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[4.3 GiB]" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.200+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="4.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.200+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.200+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 parallel=1 available=24986779648 required="21.6 GiB" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=DEBUG source=sched.go:249 msg="new model fits with existing models, loading" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="109.1 GiB" free_swap="3.7 GiB" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.203+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.6 GiB" memory.required.partial="21.6 GiB" memory.required.kv="12.9 GiB" memory.required.allocations="[21.6 GiB]" memory.weights.total="14.5 GiB" memory.weights.repeating="14.2 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="5.8 GiB" memory.graph.partial="6.2 GiB" ``` ## 6 I monitor the GPU operation by executing the command ```gpustat -i 1```, as follows: ### 6.1 If the environment variable is set as Environment="OLLAMA_SCHED_SPREAD=1" (as shown in ollama1.log, model 1 runs successfully, while models 2 and 3 fail) All three models run on GPUs 1, 2, and 3 (which is basically as expected), but I do not know why GPU 0 has not been used or is only occasionally occupied. ### 6.2 If the environment variable is not set as Environment="OLLAMA_SCHED_SPREAD=1" (as shown in ollama2.log, all three models run successfully) First model: It runs on GPUs 1, 2, and 3. I do not know why GPU 1 has not been used or is only occasionally occupied, and I am unclear why this model can still run on multiple GPU cards without setting the environment variable Environment="OLLAMA_SCHED_SPREAD=1". Second model: It runs only on GPU 2 (as expected). Third model: It runs only on GPU 3 (as expected).
Author
Owner

@dhiltgen commented on GitHub (Oct 22, 2024):

The reason we don't default to OLLAMA_SCHED_SPREAD=1 is because most users see slower performance due to CPU bottlenecks when a model could load in a single GPU. Have you analyzed the performance to determine that OLLAMA_SCHED_SPREAD=1 does actually increase performance in your setup?

<!-- gh-comment-id:2430347131 --> @dhiltgen commented on GitHub (Oct 22, 2024): The reason we don't default to `OLLAMA_SCHED_SPREAD=1` is because most users see slower performance due to CPU bottlenecks when a model could load in a single GPU. Have you analyzed the performance to determine that `OLLAMA_SCHED_SPREAD=1` does actually increase performance in your setup?
Author
Owner

@goactiongo commented on GitHub (Oct 23, 2024):

Due to the setting OLLAMA_SCHED_SPREAD=1 causing all GPU resources not to be released in a timely manner, resulting in other requests failing due to lack of GPU resources for a period of time, I have canceled this setting. 

Furthermore, following the guidance of @rick-github, I have set the following environment variables: OLLAMA_NUM_PARALLEL=1, OLLAMA_FLASH_ATTENTION=1, and GGML_CUDA_ENABLE_UNIFIED_MEMORY=1.

---Original---
From: "Daniel @.>
Date: Wed, Oct 23, 2024 05:30 AM
To: @.
>;
Cc: @.@.>;
Subject: Re: [ollama/ollama] The issue regarding concurrent processing withmultiple GPU cards (Issue #7253)

The reason we don't default to OLLAMA_SCHED_SPREAD=1 is because most users see slower performance due to CPU bottlenecks when a model could load in a single GPU. Have you analyzed the performance to determine that OLLAMA_SCHED_SPREAD=1 does actually increase performance in your setup?

Our current scheduling algorithm does have some difficulty dealing with GPUs that have very different VRAM sizes. I believe that coupled with under-estimating VRAM requirements for large context size is likely leading us to try to put too many layers on the smallest GPU when there's ample room on the larger GPU, which also explains why turning spread on causes this problem to get worse.


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: @.***>

<!-- gh-comment-id:2433666877 --> @goactiongo commented on GitHub (Oct 23, 2024): Due to the setting OLLAMA_SCHED_SPREAD=1 causing all GPU resources not to be released in a timely manner, resulting in other requests failing due to lack of GPU resources for a period of time, I have canceled this setting.&nbsp; Furthermore, following the guidance of @rick-github, I have set the following environment variables: OLLAMA_NUM_PARALLEL=1, OLLAMA_FLASH_ATTENTION=1, and GGML_CUDA_ENABLE_UNIFIED_MEMORY=1. ---Original--- From: "Daniel ***@***.***&gt; Date: Wed, Oct 23, 2024 05:30 AM To: ***@***.***&gt;; Cc: ***@***.******@***.***&gt;; Subject: Re: [ollama/ollama] The issue regarding concurrent processing withmultiple GPU cards (Issue #7253) The reason we don't default to OLLAMA_SCHED_SPREAD=1 is because most users see slower performance due to CPU bottlenecks when a model could load in a single GPU. Have you analyzed the performance to determine that OLLAMA_SCHED_SPREAD=1 does actually increase performance in your setup? Our current scheduling algorithm does have some difficulty dealing with GPUs that have very different VRAM sizes. I believe that coupled with under-estimating VRAM requirements for large context size is likely leading us to try to put too many layers on the smallest GPU when there's ample room on the larger GPU, which also explains why turning spread on causes this problem to get worse. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***&gt;
Author
Owner

@dhiltgen commented on GitHub (Oct 30, 2024):

@SDAIer it sounds like you have a working setup now, is that correct?

<!-- gh-comment-id:2447680555 --> @dhiltgen commented on GitHub (Oct 30, 2024): @SDAIer it sounds like you have a working setup now, is that correct?
Author
Owner

@goactiongo commented on GitHub (Oct 31, 2024):

what is your question?

---Original---
From: "Daniel @.>
Date: Thu, Oct 31, 2024 00:13 AM
To: @.
>;
Cc: @.@.>;
Subject: Re: [ollama/ollama] The issue regarding concurrent processing withmultiple GPU cards (Issue #7253)

@SDAIer it sounds like you have a working setup now, is that correct?


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: @.***>

<!-- gh-comment-id:2448947933 --> @goactiongo commented on GitHub (Oct 31, 2024): what is your question? ---Original--- From: "Daniel ***@***.***&gt; Date: Thu, Oct 31, 2024 00:13 AM To: ***@***.***&gt;; Cc: ***@***.******@***.***&gt;; Subject: Re: [ollama/ollama] The issue regarding concurrent processing withmultiple GPU cards (Issue #7253) @SDAIer it sounds like you have a working setup now, is that correct? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***&gt;
Author
Owner

@rick-github commented on GitHub (Oct 31, 2024):

@SDAIer Can this issue be closed?

<!-- gh-comment-id:2449328373 --> @rick-github commented on GitHub (Oct 31, 2024): @SDAIer Can this issue be closed?
Author
Owner

@goactiongo commented on GitHub (Nov 1, 2024):

ok,I will close this issue. Thanks guys @rick-github @dhiltgen

<!-- gh-comment-id:2451195712 --> @goactiongo commented on GitHub (Nov 1, 2024): ok,I will close this issue. Thanks guys @rick-github @dhiltgen
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4609