[GH-ISSUE #7253] The issue regarding concurrent processing with multiple GPU cards #4609

New Issue

GiteaMirror · 2026-04-12T15:31:40-05:00

GiteaMirror commented

2026-04-12 15:31:40 -05:00

Originally created by @goactiongo on GitHub (Oct 18, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7253

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Premise:

There are 4 GPU cards in the Linux server, and OLLAMA_SCHED_SPREAD=1 is set, with the aim of improving the model's inference efficiency through concurrent processing on multiple GPU cards.

My Scenario:

In the same process, I wish to sequentially call 3 different LLM models to handle the same task (such as summarizing long text), with the intention that users can see the different content summarized by the 3 different LLM models and compare the processing effects of different models.

After the process runs, it can be observed that each model runs on multiple GPU cards, but there are the following issues:

After the first model finishes running, the second model reports an OOM error, and the third model sometimes succeeds and sometimes fails.
Is this issue because after the first model finishes running, all GPU resources are not released completely, and the second model fails due to lack of GPU resources when it continues to run, while the third model succeeds if the GPU resources have been released, and fails if the GPU resources are still not completely released?
If OLLAMA_SCHED_SPREAD=1 is not set, all three models will run successfully because ollama will use different GPU cards to handle the requests of the three models separately, but this method is slower because each model uses a single GPU card for processing.

My requirements are as follows:

If OLLAMA_SCHED_SPREAD=1 is set, how can GPU resources be quickly released after the first model finishes running to ensure that subsequent models do not fail due to insufficient GPU resources?
If solution 1 cannot be met, what methods can be used to improve model inference efficiency through concurrent processing on multiple GPU cards?

Summary:

The overall requirement is how to improve the efficiency of concurrent inference when there are multiple GPU cards, thereby enhancing the user experience.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.3.14

Originally created by @goactiongo on GitHub (Oct 18, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7253 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? ### Premise: There are 4 GPU cards in the Linux server, and OLLAMA_SCHED_SPREAD=1 is set, with the aim of improving the model's inference efficiency through concurrent processing on multiple GPU cards. ### My Scenario: In the same process, I wish to sequentially call 3 different LLM models to handle the same task (such as summarizing long text), with the intention that users can see the different content summarized by the 3 different LLM models and compare the processing effects of different models. After the process runs, it can be observed that each model runs on multiple GPU cards, but there are the following issues: 1. After the first model finishes running, the second model reports an OOM error, and the third model sometimes succeeds and sometimes fails. 2. Is this issue because after the first model finishes running, all GPU resources are not released completely, and the second model fails due to lack of GPU resources when it continues to run, while the third model succeeds if the GPU resources have been released, and fails if the GPU resources are still not completely released? 3. If OLLAMA_SCHED_SPREAD=1 is not set, all three models will run successfully because ollama will use different GPU cards to handle the requests of the three models separately, but this method is slower because each model uses a single GPU card for processing. ### My requirements are as follows: 1. If OLLAMA_SCHED_SPREAD=1 is set, how can GPU resources be quickly released after the first model finishes running to ensure that subsequent models do not fail due to insufficient GPU resources? 2. If solution 1 cannot be met, what methods can be used to improve model inference efficiency through concurrent processing on multiple GPU cards? ### Other Questions (without setting OLLAMA_SCHED_SPREAD=1): Ollama defaults to OLLAMA_NUM_PARALLEL=4, and if a single GPU card cannot meet the resources for 4 concurrent processes, ollama automatically sets PARALLEL=1. At this time, if a single GPU card can meet the resources needed for PARALLEL=1, one GPU card performs inference; and if a single GPU card cannot meet the resources needed for PARALLEL=1, ollama automatically uses 4 GPU cards to process, is this an automatic and default mechanism? ### Summary: The overall requirement is how to improve the efficiency of concurrent inference when there are multiple GPU cards, thereby enhancing the user experience. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.14

GiteaMirror added the bug label 2026-04-12 15:31:40 -05:00

GiteaMirror closed this issue

2026-04-12 15:31:41 -05:00

GiteaMirror commented

2026-04-12 15:31:42 -05:00

@rick-github commented on GitHub (Oct 18, 2024):

When you get an OOM error, is all VRAM allocated? Server logs and the output of nvidia-smi will aid in debugging.

@rick-github commented on GitHub (Oct 18, 2024): When you get an OOM error, is all VRAM allocated? [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) and the output of `nvidia-smi` will aid in debugging.

GiteaMirror commented

2026-04-12 15:31:42 -05:00

@goactiongo commented on GitHub (Oct 19, 2024):

Hi @rick-github
I have analyzed the relevant logs and placed the log files in this issue, and I have also copied them in issue #7146. Please close either one of them.

Thank you for your reply. I have conducted the following two tests, and I am unsure how to handle some issues, so I need your further assistance.

1. Overall Situation Description

1.1 In the same process, three models are sequentially called through the API interface to handle the content summarization task. The models include llama3.1:8b, glm4:9b, llama3.2:latest, and each model has the following parameters set:

"num_ctx": 121000,
"num_predict": 9000

1.2. If the environment variable Environment="OLLAMA_SCHED_SPREAD=1" is not set, all three models will run successfully in sequence, but the inference time is relatively long.

Log file ollama2.log
ollama2.log

1.3. After setting Environment="OLLAMA_SCHED_SPREAD=1".

In order to improve the inference efficiency of the three models and to fully utilize four GPU cards for concurrent processing, I set the environment variable Environment="OLLAMA_SCHED_SPREAD=1" as you instructed. However, after multiple tests, the first model runs successfully every time, the second model fails almost every time, and the third model sometimes succeeds and sometimes fails.
Log file ollama1.log
ollama1.log

1.4. In the ollama1.log log, the first model succeeds, and the second and third both fail.

Model 2 API Error Log as followed

{"message":"Request failed with status code 500","data":{"error":"llama runner process has terminated: error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449'"}}

Model 3 API Error Log as followed

{"message":"Request failed with status code 500","data":{"error":"an unknown error was encountered while running the model CUDA error: out of memory\n current device: 3, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:376\n cuMemCreate(&handle, reservesize, &prop, 0)\n/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error"}}

1.5 Analysis of ollama1.log log

According to your previous instructions, I interpreted the log information in ollama1.log and conducted an analysis. There are some areas where I do not understand or may have misunderstood, and I hope for your assistance.

Note: The following analysis and most of the logs are from ollama1.log. Only in sections 4.5.1 and 5.5 did I compare the relevant information from ollama2.log

2. Four GPU cards, the available resources are 23.3 GiB, 23.3 GiB, 16.8 GiB, 9.7 GiB, as follows,

October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]"

3. First Model: llama3.1:8b

3.1 By default, OLLAMA_NUM_PARALLEL=4, the required resources are partial_offload="32.3 GiB" full_offload="32.3 GiB", but none of the GPU cards can meet this requirement. As follows,

October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB"
October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB"
October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB"
October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB"
October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.630+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"

3.2 From the following log, it can be observed that OLLAMA automatically sets parallel=1, and the required resources are required="55.3 GiB" (I do not understand why the required resources are 55.3G at this time, which is greater than the 32.3G shown in 3.1? Additionally, does the parallel=1 in the following log indicate that the four concurrent processes have been reduced to one, and since it has been reduced to parallel=1, why does it require more resources), as follows

October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.630+08:00 level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe library=cuda parallel=1 required="55.3 GiB"

3.3 The model runs on four GPU cards. As follows.

Question: If Environment="OLLAMA_SCHED_SPREAD=1" is not set, why does this model still run on four GPU cards (which does not meet my expectation), while the other two models only run on one GPU card (which meets the expectation), and for this part, you can refer to the log file ollama2.log

October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.632+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split=10,11,11,1 memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="55.3 GiB" memory.required.partial="55.3 GiB" memory.required.kv="14.8 GiB" memory.required.allocations="[14.9 GiB 15.5 GiB 15.3 GiB 9.7 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="18.0 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="8.1 GiB" memory.graph.partial="8.1 GiB"

3.4 The first model works normally, as follows:

October 19 22:19:34 gpu ollama[60399]: DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347574
October 19 22:19:34 gpu ollama[60399]: DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=58174 slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347574
October 19 22:19:34 gpu ollama[60399]: DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347574
October 19 22:21:05 gpu ollama[60399]: DEBUG [print_timings] prompt eval time     =   44130.26 ms / 58174 tokens (    0.76 ms per token,  1318.23 tokens per second) | n_prompt_tokens_processed=58174 n_tokens_second=1318.2337027993692 slot_id=0 t_prompt_processing=44130.263 t_token=0.7585908309554096 task_id=2 tid="139958483587072" timestamp=1729347665
October 19 22:21:05 gpu ollama[60399]: DEBUG [print_timings] generation eval time =   47231.63 ms /   640 runs   (   73.80 ms per token,    13.55 tokens per second) | n_decoded=640 n_tokens_second=13.55024251017226 slot_id=0 t_token=73.7994171875 t_token_generation=47231.627 task_id=2 tid="139958483587072" timestamp=1729347665
October 19 22:21:05 gpu ollama[60399]: DEBUG [print_timings]           total time =   91361.89 ms | slot_id=0 t_prompt_processing=44130.263 t_token_generation=47231.627 t_total=91361.89 task_id=2 tid="139958483587072" timestamp=1729347665
October 19 22:21:05 gpu ollama[60399]: DEBUG [update_slots] slot released | n_cache_tokens=58814 n_ctx=121024 n_past=58813 n_system_tokens=0 slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347665 truncated=false
October 19 22:21:05 gpu ollama[60399]: DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=55272 status=200 tid="139956475393792" timestamp=1729347665
October 19 22:21:05 gpu ollama[60399]: DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=645 tid="139958483587072" timestamp=1729347665
October 19 22:21:06 gpu ollama[60399]: DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=55274 status=200 tid="139956467001088" timestamp=1729347666
October 19 22:21:06 gpu ollama[60399]: [GIN] 2024/10/19 - 22:21:06 | 200 |         1m38s |    172.16.1.219 | POST     "/api/generate"

4. Second Model: glm4:9b

4.1 After the first model ends, perhaps because the GPU resources have not been completely released at this time (I am not sure if this is the reason), the available resources of each GPU cannot meet the model's needs of partial_offload="30.9 GiB" full_offload="30.9 GiB". As follows,

10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.897+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[8.8 GiB 7.1 GiB 6.5 GiB 16.1 MiB]"
10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="8.8 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="7.1 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="6.5 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.1 MiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"

4.2 Next, I noticed 'resetting model to expire immediately to make room’

October 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.899+08:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0

4.3 Then, I noticed the llama server stopped and regained GPU resources (consistent with the initial available resources).

October 19 22:21:09 gpu ollama[60399]: time=2024-10-19T22:21:09.039+08:00 level=DEBUG source=server.go:1044 msg="stopping llama server"
October 19 22:21:09 gpu ollama[60399]: time=2024-10-19T22:21:09.039+08:00 level=DEBUG source=server.go:1050 msg="waiting for llama server to exit"
October 19 22:21:09 gpu ollama[60399]: time=2024-10-19T22:21:09.290+08:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="125.4 GiB" before.free="108.2 GiB" before.free_swap="3.7 GiB" now.total="125.4 GiB" now.free="108.5 GiB" now.free_swap="3.7 GiB"
October 19 22:21:09 gpu ollama[60399]: CUDA driver version: 12.2
October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.002+08:00 level=DEBUG source=server.go:1054 msg="llama server stopped"
October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.003+08:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.220+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="8.8 GiB" now.total="23.5 GiB" now.free="9.7 GiB" now.used="13.8 GiB"
October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.510+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="16.1 MiB" now.total="23.5 GiB" now.free="16.8 GiB" now.used="6.7 GiB"
October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.778+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="7.1 GiB" now.total="23.5 GiB" now.free="23.3 GiB" now.used="232.9 MiB"
October 19 22:21:11 gpu ollama[60399]: time=2024-10-19T22:21:11.017+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="6.5 GiB" now.total="23.5 GiB" now.free="23.3 GiB" now.used="232.9 MiB"

4.4 However, the available resources of each GPU card still cannot meet the needs of the second model.

Question: Is OLLAMA_NUM_PARALLEL still set to 4 at this time?

10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.112+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]"
10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB"
10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"

4.5 Then the model uses four GPU cards to run, and the available resources memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" can meet the memory.required.full="43.1 GiB". As follows,

Question: At this time, OLLAMA_NUM_PARALLEL=4, why doesn't it automatically set OLLAMA_NUM_PARALLEL=1 like the first model? And assess whether any one of the GPU cards can meet the resources required by OLLAMA_NUM_PARALLEL=1?

October 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.114+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=41 layers.split=12,12,12,5 memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="43.1 GiB" memory.required.partial="43.1 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[11.1 GiB 11.3 GiB 11.1 GiB 9.5 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.4 GiB" memory.weights.nonrepeating="485.6 MiB" memory.graph.full="7.8 GiB" memory.graph.partial="7.8 GiB"

4.5.1 The following information is from ollama2.log, and it can be observed that when `OLLAMA_SCHED_SPREAD=1` is not set, this model eventually runs on one of the GPU cards and ultimately succeeds. (Different from 4.5 and 4.6 above).

10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.306+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[16.8 GiB]"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[9.7 GiB]"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 gpu=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 parallel=1 available=24986779648 required="17.7 GiB"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="109.8 GiB" free_swap="3.7 GiB"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="17.7 GiB" memory.required.partial="17.7 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[17.7 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.4 GiB" memory.weights.nonrepeating="485.6 MiB" memory.graph.full="7.6 GiB" memory.graph.partial="7.8 GiB"

4.6 Next, I saw the OOM error message in the logs.

I don't understand why these errors occurred, allocating 8574.52 MiB on device 3: cudaMalloc failed: out of memory. From the above logs, it seems that the available resources on device 3 should be sufficient.

10月 19 22:21:15 gpu ollama[60399]: llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
10月 19 22:21:15 gpu ollama[60399]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8574.52 MiB on device 3: cudaMalloc failed: out of memory
10月 19 22:21:15 gpu ollama[60399]: ggml_gallocr_reserve_n: failed to allocate CUDA3 buffer of size 8991031296
10月 19 22:21:15 gpu ollama[60399]: llama_new_context_with_model: failed to allocate compute buffers
10月 19 22:21:16 gpu ollama[60399]: llama_init_from_gpt_params: error: failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449'
10月 19 22:21:16 gpu ollama[60399]: ERROR [load_model] unable to load model | model="/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449" tid="140035751489536" timestamp=1729347676
10月 19 22:21:16 gpu ollama[60399]: terminate called without an active exception
10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.326+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.515+08:00 level=DEBUG source=server.go:428 msg="llama runner terminated" error="signal: aborted"
10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.576+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449'"
10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=sched.go:458 msg="triggering expiration for failed load" model=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449
10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449
10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449
10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="125.4 GiB" before.free="109.8 GiB" before.free_swap="3.7 GiB" now.total="125.4 GiB" now.free="109.8 GiB" now.free_swap="3.7 GiB"
10月 19 22:21:16 gpu ollama[60399]: [GIN] 2024/10/19 - 22:21:16 | 500 | 10.016638054s |    172.16.1.219 | POST     "/api/generate"

4.7 Upon checking the logs through the frontend application, the API interface returned the following error message (this should be due to the errors mentioned above):

{"message":"Request failed with status code 500","data":{"error":"llama runner process has terminated: error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449'"}

5. Third Model: llama3.2:latest

5.1 At this point, the resources of the four GPU cards have been completely released. By default, OLLAMA_NUM_PARALLEL=4, and the required resources are partial_offload="24.9 GiB" full_offload="24.9 GiB", but none of the individual GPU cards can meet the required resources. As follows,

10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.064+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"

5.2 The following log shows parallel=1, and the model runs on four GPU cards.

Question 1: Why does parallel=1 require more resources, 43.7G, memory.required.full="43.7 GiB", while only 24.9G was needed in 5.1?
Question 2: Why do models 1 and 3 automatically downgrade from parallel=4 to parallel=1, but the second model does not automatically adjust to parallel=1?

10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff library=cuda parallel=1 required="43.7 GiB"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="109.7 GiB" free_swap="3.7 GiB"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]"
10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.066+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split=8,9,8,4 memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="43.7 GiB" memory.required.partial="43.7 GiB" memory.required.kv="12.9 GiB" memory.required.allocations="[11.4 GiB 11.7 GiB 11.4 GiB 9.3 GiB]" memory.weights.total="14.5 GiB" memory.weights.repeating="14.2 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="6.2 GiB" memory.graph.partial="6.2 GiB"

5.3 According to the above analysis, the available resources of the four GPU cards (23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB) can meet the model's resource requirement of 43.7G, why does the third model suddenly report an OOM error.

10月 19 22:21:46 gpu ollama[60399]: CUDA error: out of memory

and

10月 19 22:21:47 gpu ollama[60399]: No symbol table is loaded.  Use the "file" command.
10月 19 22:21:47 gpu ollama[60399]: [Inferior 1 (process 60845) detached]
10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.580+08:00 level=DEBUG source=server.go:1044 msg="stopping llama server"
10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.581+08:00 level=DEBUG source=server.go:1050 msg="waiting for llama server to exit"
10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.908+08:00 level=DEBUG source=server.go:428 msg="llama runner terminated" error="signal: aborted"
10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.908+08:00 level=DEBUG source=server.go:1054 msg="llama server stopped"
10月 19 22:21:47 gpu ollama[60399]: [GIN] 2024/10/19 - 22:21:47 | 500 | 30.138749435s |    172.16.1.219 | POST     "/api/generate"

5.4 Upon checking the logs through the frontend application, the API interface returned the following error message:

{"message":"Request failed with status code 500","data":{"error":"an unknown error was encountered while running the model CUDA error: out of memory\n current device: 3, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:376\n cuMemCreate(&handle, reservesize, &prop, 0)\n/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error"}

5.5 The following log is from ollama2.log, and it can be observed that the model eventually runs on a single GPU card and ultimately succeeds. (Different from 5.2 and 5.3 above)

10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.192+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.193+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.193+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.193+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[16.8 GiB]"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.197+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.198+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.198+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[9.7 GiB]"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.199+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.199+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.199+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[4.3 GiB]"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.200+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="4.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.200+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.200+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 parallel=1 available=24986779648 required="21.6 GiB"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=DEBUG source=sched.go:249 msg="new model fits with existing models, loading"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="109.1 GiB" free_swap="3.7 GiB"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.203+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.6 GiB" memory.required.partial="21.6 GiB" memory.required.kv="12.9 GiB" memory.required.allocations="[21.6 GiB]" memory.weights.total="14.5 GiB" memory.weights.repeating="14.2 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="5.8 GiB" memory.graph.partial="6.2 GiB"

6 I monitor the GPU operation by executing the command `gpustat -i 1`, as follows:

6.1 If the environment variable is set as Environment="OLLAMA_SCHED_SPREAD=1" (as shown in ollama1.log, model 1 runs successfully, while models 2 and 3 fail)

All three models run on GPUs 1, 2, and 3 (which is basically as expected), but I do not know why GPU 0 has not been used or is only occasionally occupied.

6.2 If the environment variable is not set as Environment="OLLAMA_SCHED_SPREAD=1" (as shown in ollama2.log, all three models run successfully)

First model: It runs on GPUs 1, 2, and 3. I do not know why GPU 1 has not been used or is only occasionally occupied, and I am unclear why this model can still run on multiple GPU cards without setting the environment variable Environment="OLLAMA_SCHED_SPREAD=1".
Second model: It runs only on GPU 2 (as expected).
Third model: It runs only on GPU 3 (as expected).

@goactiongo commented on GitHub (Oct 19, 2024): Hi @rick-github I have analyzed the relevant logs and placed the log files in this issue, and I have also copied them in issue #7146. Please close either one of them. Thank you for your reply. I have conducted the following two tests, and I am unsure how to handle some issues, so I need your further assistance. ## 1. Overall Situation Description ### 1.1 In the same process, three models are sequentially called through the API interface to handle the content summarization task. The models include llama3.1:8b, glm4:9b, llama3.2:latest, and each model has the following parameters set: ``` "num_ctx": 121000, "num_predict": 9000 ``` ### 1.2. If the environment variable Environment="OLLAMA_SCHED_SPREAD=1" is not set, all three models will run successfully in sequence, but the inference time is relatively long. Log file ollama2.log [ollama2.log](https://github.com/user-attachments/files/17446683/ollama2.log) ### 1.3. After setting Environment="OLLAMA_SCHED_SPREAD=1". In order to improve the inference efficiency of the three models and to fully utilize four GPU cards for concurrent processing, I set the environment variable Environment="OLLAMA_SCHED_SPREAD=1" as you instructed. However, after multiple tests, the first model runs successfully every time, the second model fails almost every time, and the third model sometimes succeeds and sometimes fails. Log file ollama1.log [ollama1.log](https://github.com/user-attachments/files/17446687/ollama1.log) ### 1.4. In the ollama1.log log, the first model succeeds, and the second and third both fail. **Model 2 API Error Log as followed** ``` {"message":"Request failed with status code 500","data":{"error":"llama runner process has terminated: error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449'"}} ``` **Model 3 API Error Log as followed** ``` {"message":"Request failed with status code 500","data":{"error":"an unknown error was encountered while running the model CUDA error: out of memory\n current device: 3, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:376\n cuMemCreate(&handle, reservesize, &prop, 0)\n/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error"}} ``` ### 1.5 Analysis of ollama1.log log According to your previous instructions, I interpreted the log information in ollama1.log and conducted an analysis. There are some areas where I do not understand or may have misunderstood, and I hope for your assistance. **Note: The following analysis and most of the logs are from ollama1.log. Only in sections 4.5.1 and 5.5 did I compare the relevant information from ollama2.log** ## 2. Four GPU cards, the available resources are 23.3 GiB, 23.3 GiB, 16.8 GiB, 9.7 GiB, as follows, ``` October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" ``` ## 3. First Model: llama3.1:8b ### 3.1 By default, OLLAMA_NUM_PARALLEL=4, the required resources are partial_offload="32.3 GiB" full_offload="32.3 GiB", but none of the GPU cards can meet this requirement. As follows, ``` October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB" October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB" October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB" October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.629+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="2.0 GiB" gpu_zer_overhead="0 B" partial_offload="32.3 GiB" full_offload="32.3 GiB" October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.630+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" ``` ### 3.2 From the following log, it can be observed that OLLAMA automatically sets parallel=1, and the required resources are required="55.3 GiB" (I do not understand why the required resources are 55.3G at this time, which is greater than the 32.3G shown in 3.1? Additionally, does the parallel=1 in the following log indicate that the four concurrent processes have been reduced to one, and since it has been reduced to parallel=1, why does it require more resources), as follows ``` October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.630+08:00 level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe library=cuda parallel=1 required="55.3 GiB" ``` ### 3.3 The model runs on four GPU cards. As follows. Question: If Environment="OLLAMA_SCHED_SPREAD=1" is not set, why does this model still run on four GPU cards (which does not meet my expectation), while the other two models only run on one GPU card (which meets the expectation), and for this part, you can refer to the log file ollama2.log ``` October 19 22:19:29 gpu ollama[60399]: time=2024-10-19T22:19:29.632+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split=10,11,11,1 memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="55.3 GiB" memory.required.partial="55.3 GiB" memory.required.kv="14.8 GiB" memory.required.allocations="[14.9 GiB 15.5 GiB 15.3 GiB 9.7 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="18.0 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="8.1 GiB" memory.graph.partial="8.1 GiB" ``` ### 3.4 The first model works normally, as follows: ``` October 19 22:19:34 gpu ollama[60399]: DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347574 October 19 22:19:34 gpu ollama[60399]: DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=58174 slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347574 October 19 22:19:34 gpu ollama[60399]: DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347574 October 19 22:21:05 gpu ollama[60399]: DEBUG [print_timings] prompt eval time = 44130.26 ms / 58174 tokens ( 0.76 ms per token, 1318.23 tokens per second) | n_prompt_tokens_processed=58174 n_tokens_second=1318.2337027993692 slot_id=0 t_prompt_processing=44130.263 t_token=0.7585908309554096 task_id=2 tid="139958483587072" timestamp=1729347665 October 19 22:21:05 gpu ollama[60399]: DEBUG [print_timings] generation eval time = 47231.63 ms / 640 runs ( 73.80 ms per token, 13.55 tokens per second) | n_decoded=640 n_tokens_second=13.55024251017226 slot_id=0 t_token=73.7994171875 t_token_generation=47231.627 task_id=2 tid="139958483587072" timestamp=1729347665 October 19 22:21:05 gpu ollama[60399]: DEBUG [print_timings] total time = 91361.89 ms | slot_id=0 t_prompt_processing=44130.263 t_token_generation=47231.627 t_total=91361.89 task_id=2 tid="139958483587072" timestamp=1729347665 October 19 22:21:05 gpu ollama[60399]: DEBUG [update_slots] slot released | n_cache_tokens=58814 n_ctx=121024 n_past=58813 n_system_tokens=0 slot_id=0 task_id=2 tid="139958483587072" timestamp=1729347665 truncated=false October 19 22:21:05 gpu ollama[60399]: DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=55272 status=200 tid="139956475393792" timestamp=1729347665 October 19 22:21:05 gpu ollama[60399]: DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=645 tid="139958483587072" timestamp=1729347665 October 19 22:21:06 gpu ollama[60399]: DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=55274 status=200 tid="139956467001088" timestamp=1729347666 October 19 22:21:06 gpu ollama[60399]: [GIN] 2024/10/19 - 22:21:06 | 200 | 1m38s | 172.16.1.219 | POST "/api/generate" ``` ## 4. Second Model: glm4:9b ### 4.1 After the first model ends, perhaps because the GPU resources have not been completely released at this time (I am not sure if this is the reason), the available resources of each GPU cannot meet the model's needs of partial_offload="30.9 GiB" full_offload="30.9 GiB". As follows, ``` 10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.897+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[8.8 GiB 7.1 GiB 6.5 GiB 16.1 MiB]" 10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="8.8 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="7.1 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="6.5 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.1 MiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.898+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" ``` ### 4.2 Next, I noticed 'resetting model to expire immediately to make room’ ``` October 19 22:21:07 gpu ollama[60399]: time=2024-10-19T22:21:07.899+08:00 level=DEBUG source=sched.go:283 msg="resetting model to expire immediately to make room" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0 ``` ### 4.3 Then, I noticed the llama server stopped and regained GPU resources (consistent with the initial available resources). ``` October 19 22:21:09 gpu ollama[60399]: time=2024-10-19T22:21:09.039+08:00 level=DEBUG source=server.go:1044 msg="stopping llama server" October 19 22:21:09 gpu ollama[60399]: time=2024-10-19T22:21:09.039+08:00 level=DEBUG source=server.go:1050 msg="waiting for llama server to exit" October 19 22:21:09 gpu ollama[60399]: time=2024-10-19T22:21:09.290+08:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="125.4 GiB" before.free="108.2 GiB" before.free_swap="3.7 GiB" now.total="125.4 GiB" now.free="108.5 GiB" now.free_swap="3.7 GiB" October 19 22:21:09 gpu ollama[60399]: CUDA driver version: 12.2 October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.002+08:00 level=DEBUG source=server.go:1054 msg="llama server stopped" October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.003+08:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.220+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="8.8 GiB" now.total="23.5 GiB" now.free="9.7 GiB" now.used="13.8 GiB" October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.510+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="16.1 MiB" now.total="23.5 GiB" now.free="16.8 GiB" now.used="6.7 GiB" October 19 22:21:10 gpu ollama[60399]: time=2024-10-19T22:21:10.778+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="7.1 GiB" now.total="23.5 GiB" now.free="23.3 GiB" now.used="232.9 MiB" October 19 22:21:11 gpu ollama[60399]: time=2024-10-19T22:21:11.017+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 name="NVIDIA A30" overhead="0 B" before.total="23.5 GiB" before.free="6.5 GiB" now.total="23.5 GiB" now.free="23.3 GiB" now.used="232.9 MiB" ``` ### 4.4 However, the available resources of each GPU card still cannot meet the needs of the second model. Question: Is OLLAMA_NUM_PARALLEL still set to 4 at this time? ``` 10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.112+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" 10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.9 GiB" 10月 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.113+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" ``` ### 4.5 Then the model uses four GPU cards to run, and the available resources memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" can meet the memory.required.full="43.1 GiB". As follows, Question: At this time, OLLAMA_NUM_PARALLEL=4, why doesn't it automatically set OLLAMA_NUM_PARALLEL=1 like the first model? And assess whether any one of the GPU cards can meet the resources required by OLLAMA_NUM_PARALLEL=1? ``` October 19 22:21:12 gpu ollama[60399]: time=2024-10-19T22:21:12.114+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=41 layers.split=12,12,12,5 memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="43.1 GiB" memory.required.partial="43.1 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[11.1 GiB 11.3 GiB 11.1 GiB 9.5 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.4 GiB" memory.weights.nonrepeating="485.6 MiB" memory.graph.full="7.8 GiB" memory.graph.partial="7.8 GiB" ``` ### 4.5.1 The following information is from ollama2.log, and it can be observed that when ```OLLAMA_SCHED_SPREAD=1``` is not set, this model eventually runs on one of the GPU cards and ultimately succeeds. (Different from 4.5 and 4.6 above). ``` 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.306+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.307+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[16.8 GiB]" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[9.7 GiB]" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="582.1 MiB" gpu_zer_overhead="0 B" partial_offload="30.9 GiB" full_offload="30.5 GiB" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.308+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 gpu=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 parallel=1 available=24986779648 required="17.7 GiB" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="109.8 GiB" free_swap="3.7 GiB" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" 10月 20 00:12:25 gpu ollama[1265]: time=2024-10-20T00:12:25.309+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="17.7 GiB" memory.required.partial="17.7 GiB" memory.required.kv="4.6 GiB" memory.required.allocations="[17.7 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.4 GiB" memory.weights.nonrepeating="485.6 MiB" memory.graph.full="7.6 GiB" memory.graph.partial="7.8 GiB" ``` ### 4.6 Next, I saw the OOM error message in the logs. I don't understand why these errors occurred, allocating 8574.52 MiB on device 3: cudaMalloc failed: out of memory. From the above logs, it seems that the available resources on device 3 should be sufficient. ``` 10月 19 22:21:15 gpu ollama[60399]: llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) 10月 19 22:21:15 gpu ollama[60399]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8574.52 MiB on device 3: cudaMalloc failed: out of memory 10月 19 22:21:15 gpu ollama[60399]: ggml_gallocr_reserve_n: failed to allocate CUDA3 buffer of size 8991031296 10月 19 22:21:15 gpu ollama[60399]: llama_new_context_with_model: failed to allocate compute buffers 10月 19 22:21:16 gpu ollama[60399]: llama_init_from_gpt_params: error: failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449' 10月 19 22:21:16 gpu ollama[60399]: ERROR [load_model] unable to load model | model="/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449" tid="140035751489536" timestamp=1729347676 10月 19 22:21:16 gpu ollama[60399]: terminate called without an active exception 10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.326+08:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" 10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.515+08:00 level=DEBUG source=server.go:428 msg="llama runner terminated" error="signal: aborted" 10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.576+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449'" 10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=sched.go:458 msg="triggering expiration for failed load" model=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449 10月 19 22:21:16 gpu ollama[60399]: time=2024-10-19T22:21:16.577+08:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="125.4 GiB" before.free="109.8 GiB" before.free_swap="3.7 GiB" now.total="125.4 GiB" now.free="109.8 GiB" now.free_swap="3.7 GiB" 10月 19 22:21:16 gpu ollama[60399]: [GIN] 2024/10/19 - 22:21:16 | 500 | 10.016638054s | 172.16.1.219 | POST "/api/generate" ``` ### 4.7 Upon checking the logs through the frontend application, the API interface returned the following error message (this should be due to the errors mentioned above): ``` {"message":"Request failed with status code 500","data":{"error":"llama runner process has terminated: error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-b506a070d1152798d435ec4e7687336567ae653b3106f73b7b4ac7be1cbc4449'"} ``` ## 5. Third Model: llama3.2:latest ### 5.1 At this point, the resources of the four GPU cards have been completely released. By default, OLLAMA_NUM_PARALLEL=4, and the required resources are partial_offload="24.9 GiB" full_offload="24.9 GiB", but none of the individual GPU cards can meet the required resources. As follows, ``` 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.064+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="24.9 GiB" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" ``` ### 5.2 The following log shows parallel=1, and the model runs on four GPU cards. Question 1: Why does parallel=1 require more resources, 43.7G, memory.required.full="43.7 GiB", while only 24.9G was needed in 5.1? Question 2: Why do models 1 and 3 automatically downgrade from parallel=4 to parallel=1, but the second model does not automatically adjust to parallel=1? ``` 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff library=cuda parallel=1 required="43.7 GiB" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="109.7 GiB" free_swap="3.7 GiB" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.065+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=4 available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" 10月 19 22:21:19 gpu ollama[60399]: time=2024-10-19T22:21:19.066+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split=8,9,8,4 memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="43.7 GiB" memory.required.partial="43.7 GiB" memory.required.kv="12.9 GiB" memory.required.allocations="[11.4 GiB 11.7 GiB 11.4 GiB 9.3 GiB]" memory.weights.total="14.5 GiB" memory.weights.repeating="14.2 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="6.2 GiB" memory.graph.partial="6.2 GiB" ``` ### 5.3 According to the above analysis, the available resources of the four GPU cards (23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB) can meet the model's resource requirement of 43.7G, why does the third model suddenly report an OOM error. ``` 10月 19 22:21:46 gpu ollama[60399]: CUDA error: out of memory ``` and ``` 10月 19 22:21:47 gpu ollama[60399]: No symbol table is loaded. Use the "file" command. 10月 19 22:21:47 gpu ollama[60399]: [Inferior 1 (process 60845) detached] 10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.580+08:00 level=DEBUG source=server.go:1044 msg="stopping llama server" 10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.581+08:00 level=DEBUG source=server.go:1050 msg="waiting for llama server to exit" 10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.908+08:00 level=DEBUG source=server.go:428 msg="llama runner terminated" error="signal: aborted" 10月 19 22:21:47 gpu ollama[60399]: time=2024-10-19T22:21:47.908+08:00 level=DEBUG source=server.go:1054 msg="llama server stopped" 10月 19 22:21:47 gpu ollama[60399]: [GIN] 2024/10/19 - 22:21:47 | 500 | 30.138749435s | 172.16.1.219 | POST "/api/generate" ``` ### 5.4 Upon checking the logs through the frontend application, the API interface returned the following error message: ``` {"message":"Request failed with status code 500","data":{"error":"an unknown error was encountered while running the model CUDA error: out of memory\n current device: 3, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:376\n cuMemCreate(&handle, reservesize, &prop, 0)\n/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error"} ``` ### 5.5 The following log is from ollama2.log, and it can be observed that the model eventually runs on a single GPU card and ultimately succeeds. (Different from 5.2 and 5.3 above) ``` 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.192+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.193+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.193+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.193+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[16.8 GiB]" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.197+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-6b83f2f6-dc65-7feb-5e02-0cd0087995e8 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="16.8 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.198+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.198+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[9.7 GiB]" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.199+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ad4cba93-ee35-2ea2-dba7-7b5772a098ce library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="9.7 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.199+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.199+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[4.3 GiB]" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.200+08:00 level=DEBUG source=memory.go:170 msg="gpu has too little memory to allocate any layers" id=GPU-ac079011-c45b-de29-f2e2-71b2e5d2d7f4 library=cuda variant=v12 compute=8.0 driver=12.2 name="NVIDIA A30" total="23.5 GiB" available="4.3 GiB" minimum_memory=479199232 layer_size="1.9 GiB" gpu_zer_overhead="0 B" partial_offload="24.9 GiB" full_offload="23.1 GiB" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.200+08:00 level=DEBUG source=memory.go:312 msg="insufficient VRAM to load any model layers" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.200+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-1a5993d8-1f60-3ecd-b80f-55ca9f1e95d2 parallel=1 available=24986779648 required="21.6 GiB" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=DEBUG source=sched.go:249 msg="new model fits with existing models, loading" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=INFO source=server.go:103 msg="system memory" total="125.4 GiB" free="109.1 GiB" free_swap="3.7 GiB" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.202+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" 10月 20 00:14:41 gpu ollama[1265]: time=2024-10-20T00:14:41.203+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.6 GiB" memory.required.partial="21.6 GiB" memory.required.kv="12.9 GiB" memory.required.allocations="[21.6 GiB]" memory.weights.total="14.5 GiB" memory.weights.repeating="14.2 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="5.8 GiB" memory.graph.partial="6.2 GiB" ``` ## 6 I monitor the GPU operation by executing the command ```gpustat -i 1```, as follows: ### 6.1 If the environment variable is set as Environment="OLLAMA_SCHED_SPREAD=1" (as shown in ollama1.log, model 1 runs successfully, while models 2 and 3 fail) All three models run on GPUs 1, 2, and 3 (which is basically as expected), but I do not know why GPU 0 has not been used or is only occasionally occupied. ### 6.2 If the environment variable is not set as Environment="OLLAMA_SCHED_SPREAD=1" (as shown in ollama2.log, all three models run successfully) First model: It runs on GPUs 1, 2, and 3. I do not know why GPU 1 has not been used or is only occasionally occupied, and I am unclear why this model can still run on multiple GPU cards without setting the environment variable Environment="OLLAMA_SCHED_SPREAD=1". Second model: It runs only on GPU 2 (as expected). Third model: It runs only on GPU 3 (as expected).

GiteaMirror commented

2026-04-12 15:31:42 -05:00

@dhiltgen commented on GitHub (Oct 22, 2024):

The reason we don't default to OLLAMA_SCHED_SPREAD=1 is because most users see slower performance due to CPU bottlenecks when a model could load in a single GPU. Have you analyzed the performance to determine that OLLAMA_SCHED_SPREAD=1 does actually increase performance in your setup?

@dhiltgen commented on GitHub (Oct 22, 2024): The reason we don't default to `OLLAMA_SCHED_SPREAD=1` is because most users see slower performance due to CPU bottlenecks when a model could load in a single GPU. Have you analyzed the performance to determine that `OLLAMA_SCHED_SPREAD=1` does actually increase performance in your setup?

GiteaMirror commented

2026-04-12 15:31:43 -05:00

@goactiongo commented on GitHub (Oct 23, 2024):

Due to the setting OLLAMA_SCHED_SPREAD=1 causing all GPU resources not to be released in a timely manner, resulting in other requests failing due to lack of GPU resources for a period of time, I have canceled this setting.

Furthermore, following the guidance of @rick-github, I have set the following environment variables: OLLAMA_NUM_PARALLEL=1, OLLAMA_FLASH_ATTENTION=1, and GGML_CUDA_ENABLE_UNIFIED_MEMORY=1.

---Original---
From: "Daniel @.>
Date: Wed, Oct 23, 2024 05:30 AM
To: @.>;
Cc: @.@.>;
Subject: Re: [ollama/ollama] The issue regarding concurrent processing withmultiple GPU cards (Issue #7253)

The reason we don't default to OLLAMA_SCHED_SPREAD=1 is because most users see slower performance due to CPU bottlenecks when a model could load in a single GPU. Have you analyzed the performance to determine that OLLAMA_SCHED_SPREAD=1 does actually increase performance in your setup?

Our current scheduling algorithm does have some difficulty dealing with GPUs that have very different VRAM sizes. I believe that coupled with under-estimating VRAM requirements for large context size is likely leading us to try to put too many layers on the smallest GPU when there's ample room on the larger GPU, which also explains why turning spread on causes this problem to get worse.

—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: @.***>

@goactiongo commented on GitHub (Oct 23, 2024): Due to the setting OLLAMA_SCHED_SPREAD=1 causing all GPU resources not to be released in a timely manner, resulting in other requests failing due to lack of GPU resources for a period of time, I have canceled this setting.  Furthermore, following the guidance of @rick-github, I have set the following environment variables: OLLAMA_NUM_PARALLEL=1, OLLAMA_FLASH_ATTENTION=1, and GGML_CUDA_ENABLE_UNIFIED_MEMORY=1. ---Original--- From: "Daniel ***@***.***> Date: Wed, Oct 23, 2024 05:30 AM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [ollama/ollama] The issue regarding concurrent processing withmultiple GPU cards (Issue #7253) The reason we don't default to OLLAMA_SCHED_SPREAD=1 is because most users see slower performance due to CPU bottlenecks when a model could load in a single GPU. Have you analyzed the performance to determine that OLLAMA_SCHED_SPREAD=1 does actually increase performance in your setup? Our current scheduling algorithm does have some difficulty dealing with GPUs that have very different VRAM sizes. I believe that coupled with under-estimating VRAM requirements for large context size is likely leading us to try to put too many layers on the smallest GPU when there's ample room on the larger GPU, which also explains why turning spread on causes this problem to get worse. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

GiteaMirror commented

2026-04-12 15:31:43 -05:00

@dhiltgen commented on GitHub (Oct 30, 2024):

@SDAIer it sounds like you have a working setup now, is that correct?

@dhiltgen commented on GitHub (Oct 30, 2024): @SDAIer it sounds like you have a working setup now, is that correct?

GiteaMirror commented

2026-04-12 15:31:43 -05:00

@goactiongo commented on GitHub (Oct 31, 2024):

what is your question？

---Original---
From: "Daniel @.>
Date: Thu, Oct 31, 2024 00:13 AM
To: @.>;
Cc: @.@.>;
Subject: Re: [ollama/ollama] The issue regarding concurrent processing withmultiple GPU cards (Issue #7253)

@SDAIer it sounds like you have a working setup now, is that correct?

—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: @.***>

@goactiongo commented on GitHub (Oct 31, 2024): what is your question？ ---Original--- From: "Daniel ***@***.***> Date: Thu, Oct 31, 2024 00:13 AM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [ollama/ollama] The issue regarding concurrent processing withmultiple GPU cards (Issue #7253) @SDAIer it sounds like you have a working setup now, is that correct? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

GiteaMirror commented

2026-04-12 15:31:44 -05:00

@rick-github commented on GitHub (Oct 31, 2024):

@SDAIer Can this issue be closed?

@rick-github commented on GitHub (Oct 31, 2024): @SDAIer Can this issue be closed?

GiteaMirror commented

2026-04-12 15:31:44 -05:00

@goactiongo commented on GitHub (Nov 1, 2024):

ok，I will close this issue. Thanks guys @rick-github @dhiltgen

@goactiongo commented on GitHub (Nov 1, 2024): ok，I will close this issue. Thanks guys @rick-github @dhiltgen

GiteaMirror referenced this issue

2026-04-12 23:32:00 -05:00

[PR #4609] Add truncation guard #11540

GiteaMirror referenced this issue

2026-04-16 05:43:51 -05:00

[PR #4609] Add truncation guard #16811

GiteaMirror referenced this issue

2026-04-19 16:04:39 -05:00

[PR #4609] Add truncation guard #22080

GiteaMirror referenced this issue

2026-04-22 22:07:35 -05:00

[PR #4609] Add truncation guard #37413

GiteaMirror referenced this issue

2026-04-24 22:31:25 -05:00

[PR #4609] Add truncation guard #42788

GiteaMirror referenced this issue

2026-04-29 13:03:41 -05:00

[PR #4609] Add truncation guard #58237

GiteaMirror referenced this issue

2026-05-05 05:45:38 -05:00

[PR #4609] Add truncation guard #73834

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#4609

[GH-ISSUE #7253] The issue regarding concurrent processing with multiple GPU cards #4609

What is the issue?

Premise:

My Scenario:

My requirements are as follows:

Other Questions (without setting OLLAMA_SCHED_SPREAD=1):

Summary:

OS

GPU

CPU

Ollama version

1. Overall Situation Description

1.1 In the same process, three models are sequentially called through the API interface to handle the content summarization task. The models include llama3.1:8b, glm4:9b, llama3.2:latest, and each model has the following parameters set:

1.2. If the environment variable Environment="OLLAMA_SCHED_SPREAD=1" is not set, all three models will run successfully in sequence, but the inference time is relatively long.

1.3. After setting Environment="OLLAMA_SCHED_SPREAD=1".

1.4. In the ollama1.log log, the first model succeeds, and the second and third both fail.

1.5 Analysis of ollama1.log log

2. Four GPU cards, the available resources are 23.3 GiB, 23.3 GiB, 16.8 GiB, 9.7 GiB, as follows,

3. First Model: llama3.1:8b

3.1 By default, OLLAMA_NUM_PARALLEL=4, the required resources are partial_offload="32.3 GiB" full_offload="32.3 GiB", but none of the GPU cards can meet this requirement. As follows,

3.3 The model runs on four GPU cards. As follows.

3.4 The first model works normally, as follows:

4. Second Model: glm4:9b

4.1 After the first model ends, perhaps because the GPU resources have not been completely released at this time (I am not sure if this is the reason), the available resources of each GPU cannot meet the model's needs of partial_offload="30.9 GiB" full_offload="30.9 GiB". As follows,

4.2 Next, I noticed 'resetting model to expire immediately to make room’

4.3 Then, I noticed the llama server stopped and regained GPU resources (consistent with the initial available resources).

4.4 However, the available resources of each GPU card still cannot meet the needs of the second model.

4.5 Then the model uses four GPU cards to run, and the available resources memory.available="[23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB]" can meet the memory.required.full="43.1 GiB". As follows,

4.5.1 The following information is from ollama2.log, and it can be observed that when OLLAMA_SCHED_SPREAD=1 is not set, this model eventually runs on one of the GPU cards and ultimately succeeds. (Different from 4.5 and 4.6 above).

4.6 Next, I saw the OOM error message in the logs.

4.7 Upon checking the logs through the frontend application, the API interface returned the following error message (this should be due to the errors mentioned above):

5. Third Model: llama3.2:latest

5.1 At this point, the resources of the four GPU cards have been completely released. By default, OLLAMA_NUM_PARALLEL=4, and the required resources are partial_offload="24.9 GiB" full_offload="24.9 GiB", but none of the individual GPU cards can meet the required resources. As follows,

5.2 The following log shows parallel=1, and the model runs on four GPU cards.

5.3 According to the above analysis, the available resources of the four GPU cards (23.3 GiB 23.3 GiB 16.8 GiB 9.7 GiB) can meet the model's resource requirement of 43.7G, why does the third model suddenly report an OOM error.

5.4 Upon checking the logs through the frontend application, the API interface returned the following error message:

5.5 The following log is from ollama2.log, and it can be observed that the model eventually runs on a single GPU card and ultimately succeeds. (Different from 5.2 and 5.3 above)

6 I monitor the GPU operation by executing the command gpustat -i 1, as follows:

6.1 If the environment variable is set as Environment="OLLAMA_SCHED_SPREAD=1" (as shown in ollama1.log, model 1 runs successfully, while models 2 and 3 fail)

6.2 If the environment variable is not set as Environment="OLLAMA_SCHED_SPREAD=1" (as shown in ollama2.log, all three models run successfully)

4.5.1 The following information is from ollama2.log, and it can be observed that when `OLLAMA_SCHED_SPREAD=1` is not set, this model eventually runs on one of the GPU cards and ultimately succeeds. (Different from 4.5 and 4.6 above).

6 I monitor the GPU operation by executing the command `gpustat -i 1`, as follows: