[GH-ISSUE #1500] GPU MIG not supported in Kubernetes #47323

Closed
opened 2026-04-28 03:35:25 -05:00 by GiteaMirror · 17 comments
Owner

Originally created by @duhow on GitHub (Dec 13, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1500

Originally assigned to: @dhiltgen on GitHub.

7db5bcf73b/llm/llama.go (L238)

Getting the GPU information (full-GPU memory) is not available - the command above returns Insufficient Permissions, as the container is assigned only a part of it via MIG (Multi-Instance GPU).

However, the container can actually view the MIG devices, and ollama should be able to use them.

root@ollama-0:/# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:05:00.0 Off |                   On |
| N/A   35C    P0    43W / 300W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    8   0   1  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Originally created by @duhow on GitHub (Dec 13, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1500 Originally assigned to: @dhiltgen on GitHub. https://github.com/jmorganca/ollama/blob/7db5bcf73bf7026970e988f56126db8f370f1b11/llm/llama.go#L238 Getting the GPU information (full-GPU memory) is not available - the command above returns `Insufficient Permissions`, as the container is assigned only a part of it via MIG (Multi-Instance GPU). However, the container can actually view the MIG devices, and `ollama` should be able to use them. ``` root@ollama-0:/# nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100 80G... Off | 00000000:05:00.0 Off | On | | N/A 35C P0 43W / 300W | N/A | N/A Default | | | | Enabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG| | | | ECC| | |==================+======================+===========+=======================| | 0 7 0 0 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 16383MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 8 0 1 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 16383MiB | | | +------------------+----------------------+-----------+-----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ```
GiteaMirror added the nvidiafeature request labels 2026-04-28 03:35:26 -05:00
Author
Owner

@duhow commented on GitHub (Jan 18, 2024):

Still not working in v0.1.20 .

2024/01/18 10:33:28 routes.go:930: Listening on [::]:11434 (version 0.1.20)
2024/01/18 10:33:29 shim_ext_server.go:142: Dynamic LLM variants [cuda]
2024/01/18 10:33:29 gpu.go:88: Detecting GPU type
2024/01/18 10:33:29 gpu.go:203: Searching for GPU management library libnvidia-ml.so
2024/01/18 10:33:29 gpu.go:248: Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.108.03]
2024/01/18 10:33:29 gpu.go:94: Nvidia GPU detected
2024/01/18 10:33:29 gpu.go:125: error looking up CUDA GPU memory: device memory info lookup failure 0: 4
2024/01/18 10:33:29 routes.go:953: no GPU detected
<!-- gh-comment-id:1898218873 --> @duhow commented on GitHub (Jan 18, 2024): Still not working in v0.1.20 . ``` 2024/01/18 10:33:28 routes.go:930: Listening on [::]:11434 (version 0.1.20) 2024/01/18 10:33:29 shim_ext_server.go:142: Dynamic LLM variants [cuda] 2024/01/18 10:33:29 gpu.go:88: Detecting GPU type 2024/01/18 10:33:29 gpu.go:203: Searching for GPU management library libnvidia-ml.so 2024/01/18 10:33:29 gpu.go:248: Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.108.03] 2024/01/18 10:33:29 gpu.go:94: Nvidia GPU detected 2024/01/18 10:33:29 gpu.go:125: error looking up CUDA GPU memory: device memory info lookup failure 0: 4 2024/01/18 10:33:29 routes.go:953: no GPU detected ```
Author
Owner

@dhiltgen commented on GitHub (Jan 27, 2024):

If I'm understanding correctly, in this environment, we will not be able to use the management library to discover the available GPU memory. That's unfortunate given we really need to know that information before we try to load the model otherwise we may over (or under) allocate VRAM and in the case of over-allocate, crash.

Similar to #1979 we might be able to refine the GPU discovery algorithm to allow you to specify how much memory we can use via an env var override, and then force the CUDA library to be used even though we couldn't perform the management library calls.

<!-- gh-comment-id:1912902897 --> @dhiltgen commented on GitHub (Jan 27, 2024): If I'm understanding correctly, in this environment, we will not be able to use the management library to discover the available GPU memory. That's unfortunate given we really need to know that information before we try to load the model otherwise we may over (or under) allocate VRAM and in the case of over-allocate, crash. Similar to #1979 we might be able to refine the GPU discovery algorithm to allow you to specify how much memory we can use via an env var override, and then force the CUDA library to be used even though we couldn't perform the management library calls.
Author
Owner

@waTeim commented on GitHub (Feb 6, 2024):

say, maybe check out my PR, what testing beyond what I've done (if any) is needed?

<!-- gh-comment-id:1930564808 --> @waTeim commented on GitHub (Feb 6, 2024): say, maybe check out my PR, what testing beyond what I've done (if any) is needed?
Author
Owner

@Defiler226 commented on GitHub (Feb 19, 2024):

Great work, solved my issue with MIG on Kubernetes! Hope we can get this to the main branch.

<!-- gh-comment-id:1952209786 --> @Defiler226 commented on GitHub (Feb 19, 2024): Great work, solved my issue with MIG on Kubernetes! Hope we can get this to the main branch.
Author
Owner

@northcode7 commented on GitHub (Feb 26, 2024):

Solved the issue with MIG here aswell. Works great on K8S. Please merge to main.

<!-- gh-comment-id:1963597340 --> @northcode7 commented on GitHub (Feb 26, 2024): Solved the issue with MIG here aswell. Works great on K8S. Please merge to main.
Author
Owner

@dhiltgen commented on GitHub (Apr 15, 2024):

Once #3418 merges, we'll be relying solely on the cudart library (no more management library) so that will help move us forward towards resolving this feature request.

<!-- gh-comment-id:2057906170 --> @dhiltgen commented on GitHub (Apr 15, 2024): Once #3418 merges, we'll be relying solely on the cudart library (no more management library) so that will help move us forward towards resolving this feature request.
Author
Owner

@dhiltgen commented on GitHub (May 21, 2024):

I'm curious if the recent transition over to the Driver API has had any impact on MIG support. Could people who have MIG configurations try out the latest ollama image builds and report back if they work properly, or if we still need to get a rebased/refined version of #2264 (or equivalent) merged to enable this usecase?

<!-- gh-comment-id:2123144545 --> @dhiltgen commented on GitHub (May 21, 2024): I'm curious if the recent transition over to the Driver API has had any impact on MIG support. Could people who have MIG configurations try out the latest ollama image builds and report back if they work properly, or if we still need to get a rebased/refined version of #2264 (or equivalent) merged to enable this usecase?
Author
Owner

@dasantonym commented on GitHub (May 24, 2024):

Hey @dhiltgen, I can confirm MIG is now working for us with the latest image and the GPU is detected. Thanks a lot!

<!-- gh-comment-id:2128918024 --> @dasantonym commented on GitHub (May 24, 2024): Hey @dhiltgen, I can confirm MIG is now working for us with the latest image and the GPU is detected. Thanks a lot!
Author
Owner

@dhiltgen commented on GitHub (May 25, 2024):

That's great to hear!

<!-- gh-comment-id:2131310979 --> @dhiltgen commented on GitHub (May 25, 2024): That's great to hear!
Author
Owner

@jonasmock commented on GitHub (May 31, 2024):

@dasantonym I have a question regarding MIG. We have 2 x A100 80GB and plan to use single MIG strategy with 14 x 10GB slices in our Kubernetes.

I want to assign 2 of the slices to Ollama pod. Is Ollama able to use both of the slices? Or will it just use one of them?

<!-- gh-comment-id:2142676438 --> @jonasmock commented on GitHub (May 31, 2024): @dasantonym I have a question regarding MIG. We have 2 x A100 80GB and plan to use single MIG strategy with 14 x 10GB slices in our Kubernetes. I want to assign 2 of the slices to Ollama pod. Is Ollama able to use both of the slices? Or will it just use one of them?
Author
Owner

@mhoehl05 commented on GitHub (Aug 1, 2024):

@dasantonym I have a question regarding MIG. We have 2 x A100 80GB and plan to use single MIG strategy with 14 x 10GB slices in our Kubernetes.

I want to assign 2 of the slices to Ollama pod. Is Ollama able to use both of the slices? Or will it just use one of them?

been testing the same setup with 1x h100 and 20gb slices for a proof of concept, running into the same issue. Ollama only utilizes 1 of 3 passed Migs:

image

<!-- gh-comment-id:2262711702 --> @mhoehl05 commented on GitHub (Aug 1, 2024): > @dasantonym I have a question regarding MIG. We have 2 x A100 80GB and plan to use single MIG strategy with 14 x 10GB slices in our Kubernetes. > > I want to assign 2 of the slices to Ollama pod. Is Ollama able to use both of the slices? Or will it just use one of them? been testing the same setup with 1x h100 and 20gb slices for a proof of concept, running into the same issue. Ollama only utilizes 1 of 3 passed Migs: ![image](https://github.com/user-attachments/assets/b96b3364-97a2-4e8f-b6ae-773c68b714ac)
Author
Owner

@dasantonym commented on GitHub (Aug 1, 2024):

Hey @mhoehl05 and @jonasmock , unfortunately I have no clue about this. Is this even supposed to be supported by the architecture? My instinct would be to just run multiple instances on one 100 each, then load-balance between them.

🤷

<!-- gh-comment-id:2262828624 --> @dasantonym commented on GitHub (Aug 1, 2024): Hey @mhoehl05 and @jonasmock , unfortunately I have no clue about this. Is this even supposed to be supported by the architecture? My instinct would be to just run multiple instances on one 100 each, then load-balance between them. # 🤷
Author
Owner

@mhoehl05 commented on GitHub (Aug 1, 2024):

@dasantonym using Kubernetes you will not be able run multiple ollama instances on one gpu since you need to pass the gpu into the container making it only avaiable for that one container.

You can load multiple models on one ollama instance, but that kind of kills the purpose of Kubernetes. Slicing the gpu into migs via the mig-manager used by the nvidia-operator in kubernetes, would be a better solution. Creating dedicated ollama instances for each model and passing migs of the gpu via your configured mig strategy (you might even want to use the ollama-operator for that). That way you can precisly orchestrate how much resources are available for each model.

Of course you can just not use kubernetes at all and deploy a beefy vm that can hold up many models. But that might cause trouble loading models, since each models is trying to reserve part of the video memory and in case you do run out of resources, models keep getting loaded and unloaded which does impact performance.

I think its important to have control over the resources that your gpu has and using kubernetes you could dynamically allocate gpu resources to your models. Yet migs need to be supported by ollama in order for that to function.

<!-- gh-comment-id:2262993096 --> @mhoehl05 commented on GitHub (Aug 1, 2024): @dasantonym using Kubernetes you will not be able run multiple ollama instances on one gpu since you need to pass the gpu into the container making it only avaiable for that one container. You can load multiple models on one ollama instance, but that kind of kills the purpose of Kubernetes. Slicing the gpu into migs via the mig-manager used by the nvidia-operator in kubernetes, would be a better solution. Creating dedicated ollama instances for each model and passing migs of the gpu via your configured mig strategy (you might even want to use the ollama-operator for that). That way you can precisly orchestrate how much resources are available for each model. Of course you can just not use kubernetes at all and deploy a beefy vm that can hold up many models. But that might cause trouble loading models, since each models is trying to reserve part of the video memory and in case you do run out of resources, models keep getting loaded and unloaded which does impact performance. I think its important to have control over the resources that your gpu has and using kubernetes you could dynamically allocate gpu resources to your models. Yet migs need to be supported by ollama in order for that to function.
Author
Owner

@dasantonym commented on GitHub (Aug 1, 2024):

Sorry, I guess this was badly phrased. I meant 1 Pod -> 1 Model / Instance -> 1 exclusive GPU or slice, then scale that up and load-balance between the Pods. MIG is working, so you should be fine. We're using it like this, at least it works for us.

<!-- gh-comment-id:2263045225 --> @dasantonym commented on GitHub (Aug 1, 2024): Sorry, I guess this was badly phrased. I meant 1 Pod -> 1 Model / Instance -> 1 exclusive GPU or slice, then scale that up and load-balance between the Pods. MIG is working, so you should be fine. We're using it like this, at least it works for us.
Author
Owner

@mhoehl05 commented on GitHub (Aug 2, 2024):

oh yes of course thats a valid solution. but since we have models that require different vram capacities, we would need to use mixed strategy, and slice up accordingly to the models we use. That would raise the problem that each model needs a "mig-tier" that can be used to scale. For instance we might have 20gb slices and 50 gb slices accross the cluster and a model like llama3.1:70b could only utilize the 50gb slices (or rather perform decent on these slices).

Also since a gpu can only be sliced while not in use, youd need to find a strategy that suits you best. Otherwise you would have to drain a node and reslice its gpu(s).

I think a more suitable approach would be to create multiple smaller slices that can be consumed by larger models, but are sufficient to run smaller models.

<!-- gh-comment-id:2264815651 --> @mhoehl05 commented on GitHub (Aug 2, 2024): oh yes of course thats a valid solution. but since we have models that require different vram capacities, we would need to use mixed strategy, and slice up accordingly to the models we use. That would raise the problem that each model needs a "mig-tier" that can be used to scale. For instance we might have 20gb slices and 50 gb slices accross the cluster and a model like llama3.1:70b could only utilize the 50gb slices (or rather perform decent on these slices). Also since a gpu can only be sliced while not in use, youd need to find a strategy that suits you best. Otherwise you would have to drain a node and reslice its gpu(s). I think a more suitable approach would be to create multiple smaller slices that can be consumed by larger models, but are sufficient to run smaller models.
Author
Owner

@waTeim commented on GitHub (Aug 2, 2024):

check me if I'm wrong on this, but you can't just add up the VRAM (e.g. use 10 5G instances), each MIG instance must have at least as much of the VRAM as what the layer that needs to be loaded on it requires -- at least from the limited debugging messages I saw.

<!-- gh-comment-id:2265734851 --> @waTeim commented on GitHub (Aug 2, 2024): check me if I'm wrong on this, but you can't just add up the VRAM (e.g. use 10 5G instances), each MIG instance must have at least as much of the VRAM as what the layer that needs to be loaded on it requires -- at least from the limited debugging messages I saw.
Author
Owner

@mhoehl05 commented on GitHub (Aug 5, 2024):

so from my test you can see that i have been able to run the llama3.1:70b on a single 20G MIG istance. Ollama estimates a requirement of 40gb vram for the model https://ollama.com/library/llama3.1:70b When testing the model, it was really slow. I am not sure what the machnism to be able to run larger models with less vram is. But the ollama home page tells us to use different tags for different quantizations. So i guess thats not it, but please correct me if I am wrong.

From what ive seen in previous tests ollama does utilize multiple gpus if needed. I am hoping to achieve the same behavior with migs.

<!-- gh-comment-id:2268670407 --> @mhoehl05 commented on GitHub (Aug 5, 2024): so from my test you can see that i have been able to run the llama3.1:70b on a single 20G MIG istance. Ollama estimates a requirement of 40gb vram for the model https://ollama.com/library/llama3.1:70b When testing the model, it was really slow. I am not sure what the machnism to be able to run larger models with less vram is. But the ollama home page tells us to use different tags for different quantizations. So i guess thats not it, but please correct me if I am wrong. From what ive seen in previous tests ollama does utilize multiple gpus if needed. I am hoping to achieve the same behavior with migs.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47323