[GH-ISSUE #8690] Deepseek-671B: Error: timed out waiting for llama runner to start - progress 0.00 on 8x L40S #31392

Closed
opened 2026-04-22 11:48:10 -05:00 by GiteaMirror · 24 comments
Owner

Originally created by @orlyandico on GitHub (Jan 30, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8690

What is the issue?

Ollama (0.5.7) appears to be correctly calculating how many layers to offload to the GPU with default settings. This is on a g6e.48xlarge which has 1.5TB of RAM.

Jan 30 11:56:19 ip-172-31-21-180 ollama[3237]: time=2025-01-30T11:56:19.283Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=62 layers.offload=51 layers.split=7,7,7,6,6,6,6,6 memory.available="[43.9 GiB 43.9 GiB 43.9 GiB 43.9 GiB 43.9 GiB 43.9 GiB 43.9 GiB 43.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="402.1 GiB" memory.required.partial="330.4 GiB" memory.required.kv="9.5 GiB" memory.required.allocations="[41.4 GiB 41.4 GiB 41.4 GiB 40.9 GiB 41.8 GiB 41.8 GiB 40.9 GiB 40.9 GiB]" memory.weights.total="385.0 GiB" memory.weights.repeating="384.3 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="1019.5 MiB" memory.graph.partial="1019.5 MiB"
Jan 30 11:56:19 ip-172-31-21-180 ollama[3237]: time=2025-01-30T11:56:19.284Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 --ctx-size 2048 --batch-size 512 --n-gpu-layers 51 --threads 96 --parallel 1 --tensor-split 7,7,7,6,6,6,6,6 --port 39933"

...

Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA0 (NVIDIA L40S) - 44940 MiB free
Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA1 (NVIDIA L40S) - 44940 MiB free
Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA2 (NVIDIA L40S) - 44940 MiB free
Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA3 (NVIDIA L40S) - 44940 MiB free
Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA4 (NVIDIA L40S) - 44940 MiB free
Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA5 (NVIDIA L40S) - 44940 MiB free
Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA6 (NVIDIA L40S) - 44940 MiB free
Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA7 (NVIDIA L40S) - 44940 MiB free
Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_model_loader: loaded meta data with 42 key-value pairs and 1025 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 (version GGUF V3 (latest))

However, I never see the GPU VRAM usage climbing (this normally happens on my 2 x P40 setup as the model loads into VRAM)

it is stuck at this:

Thu Jan 30 12:06:42 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:9E:00.0 Off |                    0 |
| N/A   40C    P0             81W /  350W |     433MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L40S                    On  |   00000000:A0:00.0 Off |                    0 |
| N/A   43C    P0             87W /  350W |     433MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L40S                    On  |   00000000:A2:00.0 Off |                    0 |
| N/A   41C    P0             84W /  350W |     433MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L40S                    On  |   00000000:A4:00.0 Off |                    0 |
| N/A   40C    P0             81W /  350W |     433MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA L40S                    On  |   00000000:C6:00.0 Off |                    0 |
| N/A   40C    P0             79W /  350W |     433MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA L40S                    On  |   00000000:C8:00.0 Off |                    0 |
| N/A   40C    P0             80W /  350W |     433MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA L40S                    On  |   00000000:CA:00.0 Off |                    0 |
| N/A   40C    P0             81W /  350W |     433MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA L40S                    On  |   00000000:CC:00.0 Off |                    0 |
| N/A   39C    P0             81W /  350W |     433MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      4939      C   ...rs/cuda_v12_avx/ollama_llama_server        424MiB |
|    1   N/A  N/A      4939      C   ...rs/cuda_v12_avx/ollama_llama_server        424MiB |
|    2   N/A  N/A      4939      C   ...rs/cuda_v12_avx/ollama_llama_server        424MiB |
|    3   N/A  N/A      4939      C   ...rs/cuda_v12_avx/ollama_llama_server        424MiB |
|    4   N/A  N/A      4939      C   ...rs/cuda_v12_avx/ollama_llama_server        424MiB |
|    5   N/A  N/A      4939      C   ...rs/cuda_v12_avx/ollama_llama_server        424MiB |
|    6   N/A  N/A      4939      C   ...rs/cuda_v12_avx/ollama_llama_server        424MiB |
|    7   N/A  N/A      4939      C   ...rs/cuda_v12_avx/ollama_llama_server        424MiB |
+-----------------------------------------------------------------------------------------+

and at the very end I get this error:

Jan 30 12:01:19 ip-172-31-21-180 ollama[3237]: time=2025-01-30T12:01:19.487Z level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.00 - "
Jan 30 12:01:19 ip-172-31-21-180 ollama[3237]: [GIN] 2025/01/30 - 12:01:19 | 500 |          5m4s |       127.0.0.1 | POST     "/api/generate"
Jan 30 12:01:26 ip-172-31-21-180 ollama[3237]: time=2025-01-30T12:01:26.104Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=6.61651503 model=/usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9
Jan 30 12:01:28 ip-172-31-21-180 ollama[3237]: time=2025-01-30T12:01:28.080Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=8.592545492 model=/usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9
Jan 30 12:01:30 ip-172-31-21-180 ollama[3237]: time=2025-01-30T12:01:30.058Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=10.570809357 model=/usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9


OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.5.7

Originally created by @orlyandico on GitHub (Jan 30, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8690 ### What is the issue? Ollama (0.5.7) appears to be correctly calculating how many layers to offload to the GPU with default settings. This is on a g6e.48xlarge which has 1.5TB of RAM. ``` Jan 30 11:56:19 ip-172-31-21-180 ollama[3237]: time=2025-01-30T11:56:19.283Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=62 layers.offload=51 layers.split=7,7,7,6,6,6,6,6 memory.available="[43.9 GiB 43.9 GiB 43.9 GiB 43.9 GiB 43.9 GiB 43.9 GiB 43.9 GiB 43.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="402.1 GiB" memory.required.partial="330.4 GiB" memory.required.kv="9.5 GiB" memory.required.allocations="[41.4 GiB 41.4 GiB 41.4 GiB 40.9 GiB 41.8 GiB 41.8 GiB 40.9 GiB 40.9 GiB]" memory.weights.total="385.0 GiB" memory.weights.repeating="384.3 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="1019.5 MiB" memory.graph.partial="1019.5 MiB" Jan 30 11:56:19 ip-172-31-21-180 ollama[3237]: time=2025-01-30T11:56:19.284Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 --ctx-size 2048 --batch-size 512 --n-gpu-layers 51 --threads 96 --parallel 1 --tensor-split 7,7,7,6,6,6,6,6 --port 39933" ... Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA0 (NVIDIA L40S) - 44940 MiB free Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA1 (NVIDIA L40S) - 44940 MiB free Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA2 (NVIDIA L40S) - 44940 MiB free Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA3 (NVIDIA L40S) - 44940 MiB free Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA4 (NVIDIA L40S) - 44940 MiB free Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA5 (NVIDIA L40S) - 44940 MiB free Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA6 (NVIDIA L40S) - 44940 MiB free Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_load_model_from_file: using device CUDA7 (NVIDIA L40S) - 44940 MiB free Jan 30 11:56:20 ip-172-31-21-180 ollama[3237]: llama_model_loader: loaded meta data with 42 key-value pairs and 1025 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 (version GGUF V3 (latest)) ``` However, I never see the GPU VRAM usage climbing (this normally happens on my 2 x P40 setup as the model loads into VRAM) it is stuck at this: ``` Thu Jan 30 12:06:42 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA L40S On | 00000000:9E:00.0 Off | 0 | | N/A 40C P0 81W / 350W | 433MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA L40S On | 00000000:A0:00.0 Off | 0 | | N/A 43C P0 87W / 350W | 433MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA L40S On | 00000000:A2:00.0 Off | 0 | | N/A 41C P0 84W / 350W | 433MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA L40S On | 00000000:A4:00.0 Off | 0 | | N/A 40C P0 81W / 350W | 433MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA L40S On | 00000000:C6:00.0 Off | 0 | | N/A 40C P0 79W / 350W | 433MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA L40S On | 00000000:C8:00.0 Off | 0 | | N/A 40C P0 80W / 350W | 433MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA L40S On | 00000000:CA:00.0 Off | 0 | | N/A 40C P0 81W / 350W | 433MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA L40S On | 00000000:CC:00.0 Off | 0 | | N/A 39C P0 81W / 350W | 433MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 4939 C ...rs/cuda_v12_avx/ollama_llama_server 424MiB | | 1 N/A N/A 4939 C ...rs/cuda_v12_avx/ollama_llama_server 424MiB | | 2 N/A N/A 4939 C ...rs/cuda_v12_avx/ollama_llama_server 424MiB | | 3 N/A N/A 4939 C ...rs/cuda_v12_avx/ollama_llama_server 424MiB | | 4 N/A N/A 4939 C ...rs/cuda_v12_avx/ollama_llama_server 424MiB | | 5 N/A N/A 4939 C ...rs/cuda_v12_avx/ollama_llama_server 424MiB | | 6 N/A N/A 4939 C ...rs/cuda_v12_avx/ollama_llama_server 424MiB | | 7 N/A N/A 4939 C ...rs/cuda_v12_avx/ollama_llama_server 424MiB | +-----------------------------------------------------------------------------------------+ ``` and at the very end I get this error: ``` Jan 30 12:01:19 ip-172-31-21-180 ollama[3237]: time=2025-01-30T12:01:19.487Z level=ERROR source=sched.go:455 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 0.00 - " Jan 30 12:01:19 ip-172-31-21-180 ollama[3237]: [GIN] 2025/01/30 - 12:01:19 | 500 | 5m4s | 127.0.0.1 | POST "/api/generate" Jan 30 12:01:26 ip-172-31-21-180 ollama[3237]: time=2025-01-30T12:01:26.104Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=6.61651503 model=/usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 Jan 30 12:01:28 ip-172-31-21-180 ollama[3237]: time=2025-01-30T12:01:28.080Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=8.592545492 model=/usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 Jan 30 12:01:30 ip-172-31-21-180 ollama[3237]: time=2025-01-30T12:01:30.058Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=10.570809357 model=/usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.5.7
GiteaMirror added the bug label 2026-04-22 11:48:10 -05:00
Author
Owner

@orlyandico commented on GitHub (Jan 30, 2025):

Also: I tried Nemotron and it correctly loaded (entirely) into a single GPU and did inference properly.

<!-- gh-comment-id:2624339504 --> @orlyandico commented on GitHub (Jan 30, 2025): Also: I tried Nemotron and it correctly loaded (entirely) into a single GPU and did inference properly.
Author
Owner

@rick-github commented on GitHub (Jan 30, 2025):

OLLAMA_LOAD_TIMEOUT=30m

<!-- gh-comment-id:2624545433 --> @rick-github commented on GitHub (Jan 30, 2025): `OLLAMA_LOAD_TIMEOUT=30m`
Author
Owner

@orlyandico commented on GitHub (Jan 30, 2025):

Timed out after 30 minutes. Increased env variable to 90 minutes. Still crunching after 40 minutes. VRAM footprint still stuck at 400-odd MB.

<!-- gh-comment-id:2624983099 --> @orlyandico commented on GitHub (Jan 30, 2025): Timed out after 30 minutes. Increased env variable to 90 minutes. Still crunching after 40 minutes. VRAM footprint still stuck at 400-odd MB.
Author
Owner

@rick-github commented on GitHub (Jan 30, 2025):

Is the model on local or network storage? Does iostat 5 or vmstat 5 show blocks being read?

<!-- gh-comment-id:2624988329 --> @rick-github commented on GitHub (Jan 30, 2025): Is the model on local or network storage? Does `iostat 5` or `vmstat 5` show blocks being read?
Author
Owner

@orlyandico commented on GitHub (Jan 30, 2025):

The model is sitting on a 16000 IOPS GP3 volume. It pulled and and installed properly "ollama run deepseek-r1:671b"

Given that the smaller Nemotron model ran properly, it appears the stack works fine.

ubuntu@ip-172-31-21-180:~$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  1      0 1508580736 312828 45931524    0    0    23    31    4    3  0  0 100  0  0
 1  1      0 1508552192 312828 45962820    0    0  6246     0  490  877  0  0 99  1  0
 0  1      0 1508524032 312828 45994116    0    0  6246     0  461  821  0  0 99  1  0
 0  1      0 1508491264 312828 46026980    0    0  6554     0  494  902  0  0 99  1  0
 0  1      0 1508459776 312828 46059300    0    0  6451     0  444  823  0  0 99  1  0
 0  1      0 1508428288 312828 46091620    0    0  6451     0  430  797  0  0 99  1  0
 0  1      0 1508396032 312828 46124452    0    0  6554     0  431  809  0  0 99  1  0
 0  1      0 1508364416 312828 46155780    0    0  6246     0  561  984  0  0 99  1  0
 0  1      0 1508332928 312828 46187588    0    0  6349     0  451  830  0  0 99  1  0
 0  1      0 1508297600 312828 46220420    0    0  6554     0  454  827  0  0 99  1  0
 0  1      0 1508264832 312828 46253796    0    0  6656     0  461  830  0  0 99  1  0
 0  1      0 1508232832 312828 46287140    0    0  6656     0  484  885  0  0 99  1  0
 0  1      0 1508187008 312828 46335876    0    0  9728     0  530  831  0  0 99  1  0
 0  1      0 1508122240 312828 46398500    0    0 12493     0  471  862  0  0 99  1  0
 0  1      0 1508065664 312828 46459044    0    0 12083     0  467  857  0  0 99  1  0
 0  1      0 1508025344 312828 46500100    0    0  8192     0  498  923  0  0 99  1  0
 0  1      0 1507993088 312828 46533476    0    0  6656     0  481  857  0  0 99  1  0
<!-- gh-comment-id:2624994754 --> @orlyandico commented on GitHub (Jan 30, 2025): The model is sitting on a 16000 IOPS GP3 volume. It pulled and and installed properly "ollama run deepseek-r1:671b" Given that the smaller Nemotron model ran properly, it appears the stack works fine. ``` ubuntu@ip-172-31-21-180:~$ vmstat 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 1 0 1508580736 312828 45931524 0 0 23 31 4 3 0 0 100 0 0 1 1 0 1508552192 312828 45962820 0 0 6246 0 490 877 0 0 99 1 0 0 1 0 1508524032 312828 45994116 0 0 6246 0 461 821 0 0 99 1 0 0 1 0 1508491264 312828 46026980 0 0 6554 0 494 902 0 0 99 1 0 0 1 0 1508459776 312828 46059300 0 0 6451 0 444 823 0 0 99 1 0 0 1 0 1508428288 312828 46091620 0 0 6451 0 430 797 0 0 99 1 0 0 1 0 1508396032 312828 46124452 0 0 6554 0 431 809 0 0 99 1 0 0 1 0 1508364416 312828 46155780 0 0 6246 0 561 984 0 0 99 1 0 0 1 0 1508332928 312828 46187588 0 0 6349 0 451 830 0 0 99 1 0 0 1 0 1508297600 312828 46220420 0 0 6554 0 454 827 0 0 99 1 0 0 1 0 1508264832 312828 46253796 0 0 6656 0 461 830 0 0 99 1 0 0 1 0 1508232832 312828 46287140 0 0 6656 0 484 885 0 0 99 1 0 0 1 0 1508187008 312828 46335876 0 0 9728 0 530 831 0 0 99 1 0 0 1 0 1508122240 312828 46398500 0 0 12493 0 471 862 0 0 99 1 0 0 1 0 1508065664 312828 46459044 0 0 12083 0 467 857 0 0 99 1 0 0 1 0 1508025344 312828 46500100 0 0 8192 0 498 923 0 0 99 1 0 0 1 0 1507993088 312828 46533476 0 0 6656 0 481 857 0 0 99 1 0 ```
Author
Owner

@orlyandico commented on GitHub (Jan 30, 2025):

strace from the ollama PID

ubuntu@ip-172-31-21-180:~$ sudo strace -p 4813
strace: Process 4813 attached
futex(0x58c04f8b4300, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc000031648, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 4813
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=4813, si_uid=997} ---
rt_sigreturn({mask=[]})                 = 4813
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=4813, si_uid=997} ---
rt_sigreturn({mask=[]})                 = 4813
tgkill(4813, 4862, SIGURG)              = 0
futex(0xc000280148, FUTEX_WAKE_PRIVATE, 1) = 1
sched_yield()                           = 0
futex(0x58c04f8b53d8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x58c04f8b52d8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x58c04f8b53d8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x58c04f8b52d8, FUTEX_WAKE_PRIVATE, 1) = 1
sched_yield()                           = 0
futex(0x58c04f8b53b0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x58c04f8b4300, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
sched_yield()                           = 0
futex(0x58c04f8b52d8, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
futex(0x58c04f8b53d8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x58c04f8b52d8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x58c04f8b4300, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc000033948, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000033948, FUTEX_WAKE_PRIVATE, 1) = 1
read(178, 0xc0002c7000, 4096)           = -1 EAGAIN (Resource temporarily unavailable)
futex(0x58c04f8b4300, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
epoll_pwait(160, [], 128, 0, NULL, 0)   = 0
futex(0x58c04f8b4300, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
epoll_pwait(160, [], 128, 0, NULL, 0)   = 0
futex(0x58c04f8b4300, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
epoll_pwait(160, [], 128, 0, NULL, 0)   = 0
epoll_pwait(160, [{events=EPOLLIN|EPOLLOUT, data={u32=4118282241, u64=9066244785617502209}}], 128, -1, NULL, 0) = 1
futex(0x58c04f8b53c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x58c04f8b52d8, FUTEX_WAKE_PRIVATE, 1) = 1
read(178, "GET /health HTTP/1.1\r\nHost: 127."..., 4096) = 134
futex(0xc000280148, FUTEX_WAKE_PRIVATE, 1) = 1
write(178, "HTTP/1.1 200 OK\r\nContent-Type: a"..., 148) = 148
futex(0xc000280148, FUTEX_WAKE_PRIVATE, 1) = 1
read(178, 0xc0002c7000, 4096)           = -1 EAGAIN (Resource temporarily unavailable)
futex(0x58c04f8b4300, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
epoll_pwait(160, [], 128, 0, NULL, 0)   = 0
futex(0xc000280148, FUTEX_WAKE_PRIVATE, 1) = 1
read(178, 0xc0002c7000, 4096)           = -1 EAGAIN (Resource temporarily unavailable)
<!-- gh-comment-id:2625006678 --> @orlyandico commented on GitHub (Jan 30, 2025): strace from the ollama PID ``` ubuntu@ip-172-31-21-180:~$ sudo strace -p 4813 strace: Process 4813 attached futex(0x58c04f8b4300, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0xc000031648, FUTEX_WAKE_PRIVATE, 1) = 1 getpid() = 4813 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=4813, si_uid=997} --- rt_sigreturn({mask=[]}) = 4813 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=4813, si_uid=997} --- rt_sigreturn({mask=[]}) = 4813 tgkill(4813, 4862, SIGURG) = 0 futex(0xc000280148, FUTEX_WAKE_PRIVATE, 1) = 1 sched_yield() = 0 futex(0x58c04f8b53d8, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x58c04f8b52d8, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x58c04f8b53d8, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x58c04f8b52d8, FUTEX_WAKE_PRIVATE, 1) = 1 sched_yield() = 0 futex(0x58c04f8b53b0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x58c04f8b4300, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 sched_yield() = 0 futex(0x58c04f8b52d8, FUTEX_WAIT_PRIVATE, 2, NULL) = 0 futex(0x58c04f8b53d8, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x58c04f8b52d8, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x58c04f8b4300, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0xc000033948, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc000033948, FUTEX_WAKE_PRIVATE, 1) = 1 read(178, 0xc0002c7000, 4096) = -1 EAGAIN (Resource temporarily unavailable) futex(0x58c04f8b4300, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 epoll_pwait(160, [], 128, 0, NULL, 0) = 0 futex(0x58c04f8b4300, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 epoll_pwait(160, [], 128, 0, NULL, 0) = 0 futex(0x58c04f8b4300, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 epoll_pwait(160, [], 128, 0, NULL, 0) = 0 epoll_pwait(160, [{events=EPOLLIN|EPOLLOUT, data={u32=4118282241, u64=9066244785617502209}}], 128, -1, NULL, 0) = 1 futex(0x58c04f8b53c0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x58c04f8b52d8, FUTEX_WAKE_PRIVATE, 1) = 1 read(178, "GET /health HTTP/1.1\r\nHost: 127."..., 4096) = 134 futex(0xc000280148, FUTEX_WAKE_PRIVATE, 1) = 1 write(178, "HTTP/1.1 200 OK\r\nContent-Type: a"..., 148) = 148 futex(0xc000280148, FUTEX_WAKE_PRIVATE, 1) = 1 read(178, 0xc0002c7000, 4096) = -1 EAGAIN (Resource temporarily unavailable) futex(0x58c04f8b4300, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 epoll_pwait(160, [], 128, 0, NULL, 0) = 0 futex(0xc000280148, FUTEX_WAKE_PRIVATE, 1) = 1 read(178, 0xc0002c7000, 4096) = -1 EAGAIN (Resource temporarily unavailable) ```
Author
Owner

@rick-github commented on GitHub (Jan 30, 2025):

bi is non-zero so stuff is happening. The strace is on the server, the runner (ollama_llama_server) is the one doing the work of loading the model in to the GPUs.

<!-- gh-comment-id:2625016577 --> @rick-github commented on GitHub (Jan 30, 2025): `bi` is non-zero so stuff is happening. The strace is on the server, the runner (`ollama_llama_server`) is the one doing the work of loading the model in to the GPUs.
Author
Owner

@orlyandico commented on GitHub (Jan 30, 2025):

qwen-110b runs fine.. 10GB on each of the GPUs.

<!-- gh-comment-id:2625029140 --> @orlyandico commented on GitHub (Jan 30, 2025): qwen-110b runs fine.. 10GB on each of the GPUs.
Author
Owner

@orlyandico commented on GitHub (Jan 30, 2025):

   5228 pts/0    Sl+    0:00 ollama run deepseek-r1:671b SHELL=/bin/bash PWD=/home/ubuntu LOGNAME=ubuntu XDG_SESSION_TYPE=t
   5294 ?        Sl     0:05 /usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /usr/share/olla

I did the strace on the PID that shows up in nvidia-smi (this is a different run as I interrupted the deepseek run to test qwen-110b)

Strace on the above PID also shows the futex and epoll_pwait

I don't see any other runners.

<!-- gh-comment-id:2625032752 --> @orlyandico commented on GitHub (Jan 30, 2025): ``` 5228 pts/0 Sl+ 0:00 ollama run deepseek-r1:671b SHELL=/bin/bash PWD=/home/ubuntu LOGNAME=ubuntu XDG_SESSION_TYPE=t 5294 ? Sl 0:05 /usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /usr/share/olla ``` I did the strace on the PID that shows up in nvidia-smi (this is a different run as I interrupted the deepseek run to test qwen-110b) Strace on the above PID also shows the futex and epoll_pwait I don't see any other runners.
Author
Owner

@rick-github commented on GitHub (Jan 30, 2025):

Yep, you are right, I saw the HTTP in the strace and just jumped to thinking it was the server, forgetting that the server polls the runner for health checks.

<!-- gh-comment-id:2625037754 --> @rick-github commented on GitHub (Jan 30, 2025): Yep, you are right, I saw the HTTP in the strace and just jumped to thinking it was the server, forgetting that the server polls the runner for health checks.
Author
Owner

@rick-github commented on GitHub (Jan 30, 2025):

What's sudo ls -l /proc/4813/fd show?

<!-- gh-comment-id:2625041404 --> @rick-github commented on GitHub (Jan 30, 2025): What's `sudo ls -l /proc/4813/fd` show?
Author
Owner

@orlyandico commented on GitHub (Jan 30, 2025):

ubuntu@ip-172-31-21-180:~$ sudo ls -l /proc/5294/fd
total 0
lr-x------ 1 ollama ollama 64 Jan 30 16:57 0 -> /dev/null
lrwx------ 1 ollama ollama 64 Jan 30 16:57 1 -> 'socket:[201741]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 10 -> /dev/nvidia-uvm
lrwx------ 1 ollama ollama 64 Jan 30 16:57 100 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 101 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 102 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 103 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 104 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 105 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 106 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 107 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 108 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 109 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 11 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 110 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 111 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 112 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 113 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 114 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 115 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 116 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 117 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 118 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 119 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 12 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 120 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 121 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 122 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 123 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 124 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 125 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 126 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 127 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 128 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 129 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 13 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 130 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 131 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 132 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 133 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 134 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 135 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 136 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 137 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 138 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 139 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 14 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 140 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 141 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 142 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 143 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 144 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 145 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 146 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 147 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 148 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 149 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 15 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 150 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 151 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 152 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 153 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 154 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 155 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 156 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 157 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 158 -> 'socket:[31812]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 159 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 16 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 160 -> 'anon_inode:[eventpoll]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 161 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 162 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 163 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 164 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 165 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 166 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 167 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 168 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 169 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 17 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 170 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 171 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 172 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 173 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 174 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 175 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 176 -> /dev/nvidia0
lr-x------ 1 ollama ollama 64 Jan 30 16:57 177 -> /usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9
lrwx------ 1 ollama ollama 64 Jan 30 16:57 18 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 180 -> 'socket:[31818]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 19 -> /dev/nvidia0
l-wx------ 1 ollama ollama 64 Jan 30 16:57 2 -> 'pipe:[207878]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 20 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 21 -> /dev/nvidia0
lrwx------ 1 ollama ollama 64 Jan 30 16:57 22 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 23 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 24 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 25 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 26 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 27 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 28 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 29 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 3 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 30 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 31 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 32 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 33 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 34 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 35 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 36 -> /dev/nvidia5
lrwx------ 1 ollama ollama 64 Jan 30 16:57 37 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 38 -> /dev/nvidia6
lrwx------ 1 ollama ollama 64 Jan 30 16:57 39 -> /dev/nvidia6
lr-x------ 1 ollama ollama 64 Jan 30 16:57 4 -> 'pipe:[31785]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 40 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 41 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 42 -> /dev/nvidia7
lrwx------ 1 ollama ollama 64 Jan 30 16:57 43 -> 'socket:[31796]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 44 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 45 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 46 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 47 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 48 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 49 -> /dev/nvidia1
l-wx------ 1 ollama ollama 64 Jan 30 16:57 5 -> 'pipe:[31785]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 50 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 51 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 52 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 53 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 54 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 55 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 56 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 57 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 58 -> /dev/nvidia1
lrwx------ 1 ollama ollama 64 Jan 30 16:57 59 -> /dev/nvidia1
lr-x------ 1 ollama ollama 64 Jan 30 16:57 6 -> 'pipe:[31786]'
lr-x------ 1 ollama ollama 64 Jan 30 16:57 60 -> 'pipe:[31799]'
l-wx------ 1 ollama ollama 64 Jan 30 16:57 61 -> 'pipe:[31799]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 62 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 63 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 64 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 65 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 66 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 67 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 68 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 69 -> 'anon_inode:[eventfd]'
l-wx------ 1 ollama ollama 64 Jan 30 16:57 7 -> 'pipe:[31786]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 70 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 71 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 72 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 73 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 74 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 75 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 76 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 77 -> /dev/nvidia2
lrwx------ 1 ollama ollama 64 Jan 30 16:57 78 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 79 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 8 -> /dev/nvidiactl
lrwx------ 1 ollama ollama 64 Jan 30 16:57 80 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 81 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 82 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 83 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 84 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 85 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 86 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 87 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 88 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 89 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 9 -> /dev/nvidia-uvm
lrwx------ 1 ollama ollama 64 Jan 30 16:57 90 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 91 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 92 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 93 -> /dev/nvidia3
lrwx------ 1 ollama ollama 64 Jan 30 16:57 94 -> 'anon_inode:[eventfd]'
lrwx------ 1 ollama ollama 64 Jan 30 16:57 95 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 96 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 97 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 98 -> /dev/nvidia4
lrwx------ 1 ollama ollama 64 Jan 30 16:57 99 -> /dev/nvidia4
<!-- gh-comment-id:2625056885 --> @orlyandico commented on GitHub (Jan 30, 2025): ``` ubuntu@ip-172-31-21-180:~$ sudo ls -l /proc/5294/fd total 0 lr-x------ 1 ollama ollama 64 Jan 30 16:57 0 -> /dev/null lrwx------ 1 ollama ollama 64 Jan 30 16:57 1 -> 'socket:[201741]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 10 -> /dev/nvidia-uvm lrwx------ 1 ollama ollama 64 Jan 30 16:57 100 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 101 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 102 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 103 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 104 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 105 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 106 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 107 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 108 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 109 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 11 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 110 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 111 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 112 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 113 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 114 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 115 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 116 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 117 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 118 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 119 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 12 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 120 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 121 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 122 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 123 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 124 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 125 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 126 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 127 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 128 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 129 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 13 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 130 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 131 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 132 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 133 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 134 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 135 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 136 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 137 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 138 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 139 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 14 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 140 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 141 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 142 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 143 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 144 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 145 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 146 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 147 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 148 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 149 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 15 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 150 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 151 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 152 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 153 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 154 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 155 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 156 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 157 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 158 -> 'socket:[31812]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 159 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 16 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 160 -> 'anon_inode:[eventpoll]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 161 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 162 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 163 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 164 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 165 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 166 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 167 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 168 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 169 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 17 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 170 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 171 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 172 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 173 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 174 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 175 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 176 -> /dev/nvidia0 lr-x------ 1 ollama ollama 64 Jan 30 16:57 177 -> /usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 lrwx------ 1 ollama ollama 64 Jan 30 16:57 18 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 180 -> 'socket:[31818]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 19 -> /dev/nvidia0 l-wx------ 1 ollama ollama 64 Jan 30 16:57 2 -> 'pipe:[207878]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 20 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 21 -> /dev/nvidia0 lrwx------ 1 ollama ollama 64 Jan 30 16:57 22 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 23 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 24 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 25 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 26 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 27 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 28 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 29 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 3 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 30 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 31 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 32 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 33 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 34 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 35 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 36 -> /dev/nvidia5 lrwx------ 1 ollama ollama 64 Jan 30 16:57 37 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 38 -> /dev/nvidia6 lrwx------ 1 ollama ollama 64 Jan 30 16:57 39 -> /dev/nvidia6 lr-x------ 1 ollama ollama 64 Jan 30 16:57 4 -> 'pipe:[31785]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 40 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 41 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 42 -> /dev/nvidia7 lrwx------ 1 ollama ollama 64 Jan 30 16:57 43 -> 'socket:[31796]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 44 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 45 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 46 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 47 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 48 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 49 -> /dev/nvidia1 l-wx------ 1 ollama ollama 64 Jan 30 16:57 5 -> 'pipe:[31785]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 50 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 51 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 52 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 53 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 54 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 55 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 56 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 57 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 58 -> /dev/nvidia1 lrwx------ 1 ollama ollama 64 Jan 30 16:57 59 -> /dev/nvidia1 lr-x------ 1 ollama ollama 64 Jan 30 16:57 6 -> 'pipe:[31786]' lr-x------ 1 ollama ollama 64 Jan 30 16:57 60 -> 'pipe:[31799]' l-wx------ 1 ollama ollama 64 Jan 30 16:57 61 -> 'pipe:[31799]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 62 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 63 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 64 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 65 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 66 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 67 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 68 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 69 -> 'anon_inode:[eventfd]' l-wx------ 1 ollama ollama 64 Jan 30 16:57 7 -> 'pipe:[31786]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 70 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 71 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 72 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 73 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 74 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 75 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 76 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 77 -> /dev/nvidia2 lrwx------ 1 ollama ollama 64 Jan 30 16:57 78 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 79 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 8 -> /dev/nvidiactl lrwx------ 1 ollama ollama 64 Jan 30 16:57 80 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 81 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 82 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 83 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 84 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 85 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 86 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 87 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 88 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 89 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 9 -> /dev/nvidia-uvm lrwx------ 1 ollama ollama 64 Jan 30 16:57 90 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 91 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 92 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 93 -> /dev/nvidia3 lrwx------ 1 ollama ollama 64 Jan 30 16:57 94 -> 'anon_inode:[eventfd]' lrwx------ 1 ollama ollama 64 Jan 30 16:57 95 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 96 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 97 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 98 -> /dev/nvidia4 lrwx------ 1 ollama ollama 64 Jan 30 16:57 99 -> /dev/nvidia4 ```
Author
Owner

@orlyandico commented on GitHub (Jan 30, 2025):

The key issue is that smaller models like Nemotron (70B) and Qwen (110B) work fine and as expected...

<!-- gh-comment-id:2625059625 --> @orlyandico commented on GitHub (Jan 30, 2025): The key issue is that smaller models like Nemotron (70B) and Qwen (110B) work fine and as expected...
Author
Owner

@rick-github commented on GitHub (Jan 30, 2025):

Sure. And the runner has the devices and the file open, and blocks are being read, so it's doing stuff. But it's only doing 1250 KiB/s in block reads, so that's ((8 * 43) * 1024 * 1024) / 1250 or 288568 seconds of reading, or 80 hours to read the model off storage.

<!-- gh-comment-id:2625087000 --> @rick-github commented on GitHub (Jan 30, 2025): Sure. And the runner has the devices and the file open, and blocks are being read, so it's doing stuff. But it's only doing 1250 KiB/s in block reads, so that's ((8 * 43) * 1024 * 1024) / 1250 or 288568 seconds of reading, or 80 hours to read the model off storage.
Author
Owner

@orlyandico commented on GitHub (Jan 30, 2025):

That is odd, because the 62GB blob for qwen:110b loads fairly quickly (sub 5 minutes)...

<!-- gh-comment-id:2625091234 --> @orlyandico commented on GitHub (Jan 30, 2025): That is odd, because the 62GB blob for qwen:110b loads fairly quickly (sub 5 minutes)...
Author
Owner

@rick-github commented on GitHub (Jan 30, 2025):

I don't know anything about GP3 volumes, is 16000 IOPS a peak or fully provisioned metric? Have you used up the quota for whatever the billing chunk is? It it's not being throttled by the service provider, then something is causing the runner to go-slow. I can't think off the top of my head anything in the code that would cause that.

<!-- gh-comment-id:2625101490 --> @rick-github commented on GitHub (Jan 30, 2025): I don't know anything about GP3 volumes, is 16000 IOPS a peak or fully provisioned metric? Have you used up the quota for whatever the billing chunk is? It it's not being throttled by the service provider, then something is causing the runner to go-slow. I can't think off the top of my head anything in the code that would cause that.
Author
Owner

@orlyandico commented on GitHub (Jan 30, 2025):

Interesting development.. I zapped the entire model and re-downloaded it. Now I am getting the error from https://github.com/ollama/ollama/issues/8597

I modified the startup config

[Service]
Environment="OLLAMA_LOAD_TIMEOUT=90m GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 OLLAMA_GPU_OVERHEAD=536870912 OLLAMA_FLASH_ATTENTION=1"

But continue to get the error.

Jan 30 17:23:05 ip-172-31-21-180 ollama[5826]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 45108.64 MiB on device 0: cudaMalloc failed: out of memory
Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: llama_model_load: error loading model: unable to allocate CUDA0 buffer
Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: llama_load_model_from_file: failed to load model
Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: panic: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9
Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: goroutine 34 [running]:
Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc000332000, {0x33, 0x0, 0x1, 0x0, {0xc000358000, 0x8, 0x8}, 0xc0003160d0, 0x0}, ...)
Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]:         github.com/ollama/ollama/llama/runner/runner.go:852 +0x3ad
Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1
Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]:         github.com/ollama/ollama/llama/runner/runner.go:970 +0xd0d
Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: time=2025-01-30T17:23:18.508Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
Jan 30 17:23:19 ip-172-31-21-180 ollama[5826]: time=2025-01-30T17:23:19.510Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer"
Jan 30 17:23:19 ip-172-31-21-180 ollama[5826]: [GIN] 2025/01/30 - 17:23:19 | 500 | 50.743235189s |       127.0.0.1 | POST     "/api/generate"
Jan 30 17:23:25 ip-172-31-21-180 ollama[5826]: time=2025-01-30T17:23:25.738Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=6.227852542 model=/usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9
Jan 30 17:23:27 ip-172-31-21-180 ollama[5826]: time=2025-01-30T17:23:27.739Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=8.228527312 model=/usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9
Jan 30 17:23:29 ip-172-31-21-180 ollama[5826]: time=2025-01-30T17:23:29.738Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=10.227048513 model=/usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9
<!-- gh-comment-id:2625111364 --> @orlyandico commented on GitHub (Jan 30, 2025): Interesting development.. I zapped the entire model and re-downloaded it. Now I am getting the error from https://github.com/ollama/ollama/issues/8597 I modified the startup config ``` [Service] Environment="OLLAMA_LOAD_TIMEOUT=90m GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 OLLAMA_GPU_OVERHEAD=536870912 OLLAMA_FLASH_ATTENTION=1" ``` But continue to get the error. ``` Jan 30 17:23:05 ip-172-31-21-180 ollama[5826]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 45108.64 MiB on device 0: cudaMalloc failed: out of memory Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: llama_model_load: error loading model: unable to allocate CUDA0 buffer Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: llama_load_model_from_file: failed to load model Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: panic: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: goroutine 34 [running]: Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc000332000, {0x33, 0x0, 0x1, 0x0, {0xc000358000, 0x8, 0x8}, 0xc0003160d0, 0x0}, ...) Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: github.com/ollama/ollama/llama/runner/runner.go:852 +0x3ad Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1 Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: github.com/ollama/ollama/llama/runner/runner.go:970 +0xd0d Jan 30 17:23:18 ip-172-31-21-180 ollama[5826]: time=2025-01-30T17:23:18.508Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" Jan 30 17:23:19 ip-172-31-21-180 ollama[5826]: time=2025-01-30T17:23:19.510Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer" Jan 30 17:23:19 ip-172-31-21-180 ollama[5826]: [GIN] 2025/01/30 - 17:23:19 | 500 | 50.743235189s | 127.0.0.1 | POST "/api/generate" Jan 30 17:23:25 ip-172-31-21-180 ollama[5826]: time=2025-01-30T17:23:25.738Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=6.227852542 model=/usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 Jan 30 17:23:27 ip-172-31-21-180 ollama[5826]: time=2025-01-30T17:23:27.739Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=8.228527312 model=/usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 Jan 30 17:23:29 ip-172-31-21-180 ollama[5826]: time=2025-01-30T17:23:29.738Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=10.227048513 model=/usr/share/ollama/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 ```
Author
Owner

@rick-github commented on GitHub (Jan 30, 2025):

[Service]
Environment="OLLAMA_LOAD_TIMEOUT=90m"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"
Environment="OLLAMA_GPU_OVERHEAD=536870912"
Environment="OLLAMA_FLASH_ATTENTION=1"
<!-- gh-comment-id:2625121526 --> @rick-github commented on GitHub (Jan 30, 2025): ``` [Service] Environment="OLLAMA_LOAD_TIMEOUT=90m" Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1" Environment="OLLAMA_GPU_OVERHEAD=536870912" Environment="OLLAMA_FLASH_ATTENTION=1" ```
Author
Owner

@orlyandico commented on GitHub (Jan 30, 2025):

Got it working. It appears to run at 4-5 tokens/second, and GPU utilisation is very low (changed num_ctx to 12288, so KV cache was about 5GB on each GPU).

<!-- gh-comment-id:2625269062 --> @orlyandico commented on GitHub (Jan 30, 2025): Got it working. It appears to run at 4-5 tokens/second, and GPU utilisation is very low (changed num_ctx to 12288, so KV cache was about 5GB on each GPU).
Author
Owner

@rick-github commented on GitHub (Jan 30, 2025):

LLMs are layers, each layer needs to be computed before the next, so the average utilisation per completion will be 1/gpu_devices = 12.5% in your case.

<!-- gh-comment-id:2625275258 --> @rick-github commented on GitHub (Jan 30, 2025): LLMs are layers, each layer needs to be computed before the next, so the average utilisation per completion will be 1/gpu_devices = 12.5% in your case.
Author
Owner

@orlyandico commented on GitHub (Jan 30, 2025):

GPU utilisation was much less than 12.5% - because some of the layers are on the CPU.. CPU load was off the charts (top could not even display it properly) due to having 192 threads / 96 cores on the VM.

Added these to the Modelfile:

PARAMETER num_gpu 44
PARAMETER num_ctx 12288

GPU VRAM utilisation was about 40GB per GPU including the KV cache. I guess it could be squeezed a bit more to get more layers onto the GPU.

<!-- gh-comment-id:2625284896 --> @orlyandico commented on GitHub (Jan 30, 2025): GPU utilisation was much less than 12.5% - because some of the layers are on the CPU.. CPU load was off the charts (top could not even display it properly) due to having 192 threads / 96 cores on the VM. Added these to the Modelfile: PARAMETER num_gpu 44 PARAMETER num_ctx 12288 GPU VRAM utilisation was about 40GB per GPU including the KV cache. I guess it could be squeezed a bit more to get more layers onto the GPU.
Author
Owner

@gallery2016 commented on GitHub (Feb 6, 2025):

@orlyandico Hello, I am in a similar situation as you and am experiencing:
The total GPU memory is 48*8G=384G, if use OLLAMA to run the Q4 model of 671B, and there is not enough GPU memory, then set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1, it will utilize the server's memory, and everything will be fine at startup.

However, after a period of conversation, I found that the memory of the server keeps decreasing until it runs out, and the last conversation is very laggy.

Have you encountered a similar situation? Thank you !

<!-- gh-comment-id:2639017384 --> @gallery2016 commented on GitHub (Feb 6, 2025): @orlyandico Hello, I am in a similar situation as you and am experiencing: The total GPU memory is 48*8G=384G, if use OLLAMA to run the Q4 model of 671B, and there is not enough GPU memory, then set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1, it will utilize the server's memory, and everything will be fine at startup. However, after a period of conversation, I found that the memory of the server keeps decreasing until it runs out, and the last conversation is very laggy. Have you encountered a similar situation? Thank you !
Author
Owner

@orlyandico commented on GitHub (Feb 6, 2025):

I gave up on using the L40S. Having layers on the CPU slowed things down so
much that running on pure CPU wasn’t much slower (4 tok/second).

On Thu, 6 Feb 2025 at 07:17, Li Tian @.***> wrote:

@orlyandico https://github.com/orlyandico Hello, I am in a similar
situation as you and am experiencing:
The total GPU memory is 48*8G=384G, if use OLLAMA to run the Q4 model of
671B, and there is not enough GPU memory, then set
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1, it will utilize the server's memory, and
everything will be fine at startup.

However, after a period of conversation, I found that the memory of the
server keeps decreasing until it runs out, and the last conversation is
very laggy.

Have you encountered a similar situation? Thank you !


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/8690#issuecomment-2639017384,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAKDS3GUJ75SJUDRBW2YUV32OMEBTAVCNFSM6AAAAABWFDAN3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMZZGAYTOMZYGQ
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:2639122497 --> @orlyandico commented on GitHub (Feb 6, 2025): I gave up on using the L40S. Having layers on the CPU slowed things down so much that running on pure CPU wasn’t much slower (4 tok/second). On Thu, 6 Feb 2025 at 07:17, Li Tian ***@***.***> wrote: > @orlyandico <https://github.com/orlyandico> Hello, I am in a similar > situation as you and am experiencing: > The total GPU memory is 48*8G=384G, if use OLLAMA to run the Q4 model of > 671B, and there is not enough GPU memory, then set > GGML_CUDA_ENABLE_UNIFIED_MEMORY=1, it will utilize the server's memory, and > everything will be fine at startup. > > However, after a period of conversation, I found that the memory of the > server keeps decreasing until it runs out, and the last conversation is > very laggy. > > Have you encountered a similar situation? Thank you ! > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/8690#issuecomment-2639017384>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAKDS3GUJ75SJUDRBW2YUV32OMEBTAVCNFSM6AAAAABWFDAN3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMZZGAYTOMZYGQ> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@gallery2016 commented on GitHub (Feb 6, 2025):

@orlyandico Thank you for the advice.

<!-- gh-comment-id:2639199377 --> @gallery2016 commented on GitHub (Feb 6, 2025): @orlyandico Thank you for the advice.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#31392