[GH-ISSUE #1500] GPU MIG not supported in Kubernetes #811

New Issue

@waTeim commented on GitHub (Feb 6, 2024):

say, maybe check out my PR, what testing beyond what I've done (if any) is needed?

@waTeim commented on GitHub (Feb 6, 2024): say, maybe check out my PR, what testing beyond what I've done (if any) is needed?

GiteaMirror commented

@Defiler226 commented on GitHub (Feb 19, 2024):

Great work, solved my issue with MIG on Kubernetes! Hope we can get this to the main branch.

@Defiler226 commented on GitHub (Feb 19, 2024): Great work, solved my issue with MIG on Kubernetes! Hope we can get this to the main branch.

GiteaMirror commented

@northcode7 commented on GitHub (Feb 26, 2024):

Solved the issue with MIG here aswell. Works great on K8S. Please merge to main.

@northcode7 commented on GitHub (Feb 26, 2024): Solved the issue with MIG here aswell. Works great on K8S. Please merge to main.

GiteaMirror commented

@dhiltgen commented on GitHub (Apr 15, 2024):

Once #3418 merges, we'll be relying solely on the cudart library (no more management library) so that will help move us forward towards resolving this feature request.

@dhiltgen commented on GitHub (Apr 15, 2024): Once #3418 merges, we'll be relying solely on the cudart library (no more management library) so that will help move us forward towards resolving this feature request.

GiteaMirror commented

@dhiltgen commented on GitHub (May 21, 2024):

I'm curious if the recent transition over to the Driver API has had any impact on MIG support. Could people who have MIG configurations try out the latest ollama image builds and report back if they work properly, or if we still need to get a rebased/refined version of #2264 (or equivalent) merged to enable this usecase?

@dhiltgen commented on GitHub (May 21, 2024): I'm curious if the recent transition over to the Driver API has had any impact on MIG support. Could people who have MIG configurations try out the latest ollama image builds and report back if they work properly, or if we still need to get a rebased/refined version of #2264 (or equivalent) merged to enable this usecase?

GiteaMirror commented

@dasantonym commented on GitHub (May 24, 2024):

Hey @dhiltgen, I can confirm MIG is now working for us with the latest image and the GPU is detected. Thanks a lot!

@dasantonym commented on GitHub (May 24, 2024): Hey @dhiltgen, I can confirm MIG is now working for us with the latest image and the GPU is detected. Thanks a lot!

GiteaMirror commented

@dhiltgen commented on GitHub (May 25, 2024):

That's great to hear!

@dhiltgen commented on GitHub (May 25, 2024): That's great to hear!

GiteaMirror commented

@jonasmock commented on GitHub (May 31, 2024):

@dasantonym I have a question regarding MIG. We have 2 x A100 80GB and plan to use single MIG strategy with 14 x 10GB slices in our Kubernetes.

I want to assign 2 of the slices to Ollama pod. Is Ollama able to use both of the slices? Or will it just use one of them?

@jonasmock commented on GitHub (May 31, 2024): @dasantonym I have a question regarding MIG. We have 2 x A100 80GB and plan to use single MIG strategy with 14 x 10GB slices in our Kubernetes. I want to assign 2 of the slices to Ollama pod. Is Ollama able to use both of the slices? Or will it just use one of them?

GiteaMirror commented

@mhoehl05 commented on GitHub (Aug 1, 2024):

@dasantonym I have a question regarding MIG. We have 2 x A100 80GB and plan to use single MIG strategy with 14 x 10GB slices in our Kubernetes.

I want to assign 2 of the slices to Ollama pod. Is Ollama able to use both of the slices? Or will it just use one of them?

been testing the same setup with 1x h100 and 20gb slices for a proof of concept, running into the same issue. Ollama only utilizes 1 of 3 passed Migs:

@mhoehl05 commented on GitHub (Aug 1, 2024): > @dasantonym I have a question regarding MIG. We have 2 x A100 80GB and plan to use single MIG strategy with 14 x 10GB slices in our Kubernetes. > > I want to assign 2 of the slices to Ollama pod. Is Ollama able to use both of the slices? Or will it just use one of them? been testing the same setup with 1x h100 and 20gb slices for a proof of concept, running into the same issue. Ollama only utilizes 1 of 3 passed Migs: ![image](https://github.com/user-attachments/assets/b96b3364-97a2-4e8f-b6ae-773c68b714ac)

GiteaMirror commented

@dasantonym commented on GitHub (Aug 1, 2024):

Hey @mhoehl05 and @jonasmock , unfortunately I have no clue about this. Is this even supposed to be supported by the architecture? My instinct would be to just run multiple instances on one 100 each, then load-balance between them.

🤷

@dasantonym commented on GitHub (Aug 1, 2024): Hey @mhoehl05 and @jonasmock , unfortunately I have no clue about this. Is this even supposed to be supported by the architecture? My instinct would be to just run multiple instances on one 100 each, then load-balance between them. # 🤷

GiteaMirror commented

@mhoehl05 commented on GitHub (Aug 1, 2024):

@dasantonym using Kubernetes you will not be able run multiple ollama instances on one gpu since you need to pass the gpu into the container making it only avaiable for that one container.

You can load multiple models on one ollama instance, but that kind of kills the purpose of Kubernetes. Slicing the gpu into migs via the mig-manager used by the nvidia-operator in kubernetes, would be a better solution. Creating dedicated ollama instances for each model and passing migs of the gpu via your configured mig strategy (you might even want to use the ollama-operator for that). That way you can precisly orchestrate how much resources are available for each model.

Of course you can just not use kubernetes at all and deploy a beefy vm that can hold up many models. But that might cause trouble loading models, since each models is trying to reserve part of the video memory and in case you do run out of resources, models keep getting loaded and unloaded which does impact performance.

I think its important to have control over the resources that your gpu has and using kubernetes you could dynamically allocate gpu resources to your models. Yet migs need to be supported by ollama in order for that to function.

@mhoehl05 commented on GitHub (Aug 1, 2024): @dasantonym using Kubernetes you will not be able run multiple ollama instances on one gpu since you need to pass the gpu into the container making it only avaiable for that one container. You can load multiple models on one ollama instance, but that kind of kills the purpose of Kubernetes. Slicing the gpu into migs via the mig-manager used by the nvidia-operator in kubernetes, would be a better solution. Creating dedicated ollama instances for each model and passing migs of the gpu via your configured mig strategy (you might even want to use the ollama-operator for that). That way you can precisly orchestrate how much resources are available for each model. Of course you can just not use kubernetes at all and deploy a beefy vm that can hold up many models. But that might cause trouble loading models, since each models is trying to reserve part of the video memory and in case you do run out of resources, models keep getting loaded and unloaded which does impact performance. I think its important to have control over the resources that your gpu has and using kubernetes you could dynamically allocate gpu resources to your models. Yet migs need to be supported by ollama in order for that to function.

GiteaMirror commented