[GH-ISSUE #7104] Optimizing GPU Usage for AI Models: Splitting Workloads Across Multiple GPUs Even if the Model Fits in One GPU #4511

New Issue

GiteaMirror · 2026-04-12T15:26:29-05:00

GiteaMirror commented

2026-04-12 15:26:29 -05:00

Originally created by @varyagnord on GitHub (Oct 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7104

I have a question about how Ollama works and its options for working with AI models. If there are 2 GPUs in a PC, for example, two RTX3090s, and we launch a model that has a size of 20GB VRAM, it will be loaded into one card, preferably the fastest one. This means that processing 20GB of data will be handled by approximately 10,500 CUDA cores. Is there an option to divide the model across both GPUs even if it fits on one? For example, if we split the model so that half (10GB) is processed by 10,500 CUDA cores from the first GPU and the other half by 10,500 CUDA cores from the second GPU. Then a total of 21,000 CUDA cores would process the model. Theoretically, this could improve performance. I understand that in this case, increased data exchange over the PCI-e bus might become a bottleneck, but even then such an approach could be faster. If this option does not exist yet, it might be worth implementing and experimenting with it. If it works, in the future (when using multiple GPUs with different numbers of CUDA cores), when dividing models, they should be divided proportionally to the number of CUDA cores to achieve maximum performance.

Originally created by @varyagnord on GitHub (Oct 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7104 I have a question about how Ollama works and its options for working with AI models. If there are 2 GPUs in a PC, for example, two RTX3090s, and we launch a model that has a size of 20GB VRAM, it will be loaded into one card, preferably the fastest one. This means that processing 20GB of data will be handled by approximately 10,500 CUDA cores. Is there an option to divide the model across both GPUs even if it fits on one? For example, if we split the model so that half (10GB) is processed by 10,500 CUDA cores from the first GPU and the other half by 10,500 CUDA cores from the second GPU. Then a total of 21,000 CUDA cores would process the model. Theoretically, this could improve performance. I understand that in this case, increased data exchange over the PCI-e bus might become a bottleneck, but even then such an approach could be faster. If this option does not exist yet, it might be worth implementing and experimenting with it. If it works, in the future (when using multiple GPUs with different numbers of CUDA cores), when dividing models, they should be divided proportionally to the number of CUDA cores to achieve maximum performance.

GiteaMirror added the feature request label 2026-04-12 15:26:29 -05:00

GiteaMirror closed this issue

2026-04-12 15:26:29 -05:00

GiteaMirror commented

2026-04-12 15:26:30 -05:00

@rick-github commented on GitHub (Oct 5, 2024):

You can set OLLAMA_SCHED_SPREAD=1 in the server environment to have ollama not use a single GPU. However, this doesn't speed up inference for most cases because of the serial nature of inference. A model is a stack of layers, so inference needs to be completed in a layer before the results can be used to perform inference with the next layer. For a given completion, you can't have inference on layer x performed on one GPU while another GPU does inference on layer x+n. Multiple GPUs can help when you are doing multiple parallel completions, see OLLAMA_NUM_PARALLEL, or batched completions, where a queue of completions is processed serially and sequential portions of the model are loaded across multiple GPUs, allowing multiple concurrent completions running in a portion of the model. All of which is to say that you can split a model across multiple GPUs, but it won't speed up any individual completion.

There are use cases where multiple GPUs can be used to do parallel matrix ops in a single layer, but I don't know if llama.cpp implements that logic.

@rick-github commented on GitHub (Oct 5, 2024): You can set `OLLAMA_SCHED_SPREAD=1` in the server environment to have ollama not use a single GPU. However, this doesn't speed up inference for most cases because of the serial nature of inference. A model is a stack of layers, so inference needs to be completed in a layer before the results can be used to perform inference with the next layer. For a given completion, you can't have inference on layer x performed on one GPU while another GPU does inference on layer x+n. Multiple GPUs can help when you are doing multiple parallel completions, see `OLLAMA_NUM_PARALLEL`, or batched completions, where a queue of completions is processed serially and sequential portions of the model are loaded across multiple GPUs, allowing multiple concurrent completions running in a portion of the model. All of which is to say that you can split a model across multiple GPUs, but it won't speed up any individual completion. There are use cases where multiple GPUs can be used to do parallel matrix ops in a single layer, but I don't know if llama.cpp implements that logic.

GiteaMirror commented

2026-04-12 15:26:31 -05:00

@varyagnord commented on GitHub (Oct 5, 2024):

Thanks a million.

"There are use cases where multiple GPUs can be used to do parallel matrix ops in a single layer, but I don't know if llama.cpp implements that logic." - This part is also quite interesting. Has this logic been implemented, and is its implementation planned? Might someone have this information...?

@varyagnord commented on GitHub (Oct 5, 2024): Thanks a million. "There are use cases where multiple GPUs can be used to do parallel matrix ops in a single layer, but I don't know if llama.cpp implements that logic." - This part is also quite interesting. Has this logic been implemented, and is its implementation planned? Might someone have this information...?

GiteaMirror commented

2026-04-12 15:26:33 -05:00

@rick-github commented on GitHub (Oct 5, 2024):

I had a look at the PR that implemented multi-GPU support in llama.cpp and it says "Matrix multiplications are split across GPUs and done in parallel", so it sounds like this might be done. Unfortunately I don't have a multi-GPU system to test with.

@rick-github commented on GitHub (Oct 5, 2024): I had a look at the PR that implemented multi-GPU support in llama.cpp and it says "Matrix multiplications are split across GPUs and done in parallel", so it sounds like this might be done. Unfortunately I don't have a multi-GPU system to test with.

GiteaMirror commented

2026-04-12 15:26:35 -05:00

@varyagnord commented on GitHub (Oct 5, 2024):

I have a system with 3090 and 3080 GPUs installed. However, it is likely that I need to use some special environment variables to perform calculations in this way.

@varyagnord commented on GitHub (Oct 5, 2024): I have a system with 3090 and 3080 GPUs installed. However, it is likely that I need to use some special environment variables to perform calculations in this way.

GiteaMirror commented

2026-04-12 15:26:35 -05:00

@rick-github commented on GitHub (Oct 5, 2024):

My understanding is you just need to set OLLAMA_SCHED_SPREAD=1 in the server environment, restart the server and then load a model. In the logs you should see the runner started with a --tensor-split argument and you should be good to go.

@rick-github commented on GitHub (Oct 5, 2024): My understanding is you just need to set `OLLAMA_SCHED_SPREAD=1` in the server environment, restart the server and then load a model. In the logs you should see the runner started with a `--tensor-split` argument and you should be good to go.

GiteaMirror commented

2026-04-12 15:26:36 -05:00

@igorschlum commented on GitHub (Oct 5, 2024):

Hi @varyagnord your question is interesting. I made and answer and asked ChatGPT his advice.

Your answer is close, but the concept could be clarified for better accuracy and precision. Here’s a revised version of your response:

“As I understand it, GPUs already process tasks in parallel using thousands of CUDA cores. While it might seem that splitting the model across two GPUs would improve performance, in most cases, this approach does not necessarily speed up inference. The overhead of synchronizing the data between the GPUs, as well as potential bottlenecks in data transfer over the PCI-e bus, could offset the benefits of using additional CUDA cores.”

Regarding whether you’re right, in general, splitting models across multiple GPUs is typically done for larger models that exceed the VRAM capacity of a single GPU. If a model fits comfortably within the memory of one GPU, distributing it across two GPUs often adds complexity without a significant performance boost. You are correct that GPUs already use parallelism efficiently, but the added data exchange between GPUs can slow things down rather than accelerate them. However, some specialized frameworks may support multi-GPU inference with optimizations to reduce the overhead, though it’s not the default approach.

@igorschlum commented on GitHub (Oct 5, 2024): Hi @varyagnord your question is interesting. I made and answer and asked ChatGPT his advice. Your answer is close, but the concept could be clarified for better accuracy and precision. Here’s a revised version of your response: “As I understand it, GPUs already process tasks in parallel using thousands of CUDA cores. While it might seem that splitting the model across two GPUs would improve performance, in most cases, this approach does not necessarily speed up inference. The overhead of synchronizing the data between the GPUs, as well as potential bottlenecks in data transfer over the PCI-e bus, could offset the benefits of using additional CUDA cores.” Regarding whether you’re right, in general, splitting models across multiple GPUs is typically done for larger models that exceed the VRAM capacity of a single GPU. If a model fits comfortably within the memory of one GPU, distributing it across two GPUs often adds complexity without a significant performance boost. You are correct that GPUs already use parallelism efficiently, but the added data exchange between GPUs can slow things down rather than accelerate them. However, some specialized frameworks may support multi-GPU inference with optimizations to reduce the overhead, though it’s not the default approach.

GiteaMirror commented

2026-04-12 15:26:37 -05:00

@varyagnord commented on GitHub (Oct 5, 2024):

My understanding is you just need to set OLLAMA_SCHED_SPREAD=1 in the server environment, restart the server and then load a model. In the logs you should see the runner started with a --tensor-split argument and you should be good to go.

I proceeded in this manner and compared the performance. The number of tokens per second is only slightly lower when splitting a model that requires 6GB of video memory between two video cards, meaning each card gets approximately 3GB of layers. The decrease in performance is almost imperceptible, but it does exist; likely, the layers residing on the 3080 are processed slightly slower due to the fact that the 3080 has fewer CUDA cores. It appears that processing occurs layer by layer, with one card handling its own layers first and then passing control to the next card for its layers, without simultaneous operations on a single layer across both GPUs.

However, theoretically, this approach could increase the speed of parallel requests: for example, if the model is evenly distributed between two cards with similar performance, while one card processes the second request in its layers, the second request might already have been processed in the initial layers (on the first card) and continue to be processed in layers located on the second card. In this way, the second card does not remain idle waiting for it to process layers from a single request but can engage in processing other layers from another request, thus optimizing resource utilization. I hope my thoughts are understandable. )))

@varyagnord commented on GitHub (Oct 5, 2024): > My understanding is you just need to set `OLLAMA_SCHED_SPREAD=1` in the server environment, restart the server and then load a model. In the logs you should see the runner started with a `--tensor-split` argument and you should be good to go. I proceeded in this manner and compared the performance. The number of tokens per second is only slightly lower when splitting a model that requires 6GB of video memory between two video cards, meaning each card gets approximately 3GB of layers. The decrease in performance is almost imperceptible, but it does exist; likely, the layers residing on the 3080 are processed slightly slower due to the fact that the 3080 has fewer CUDA cores. It appears that processing occurs layer by layer, with one card handling its own layers first and then passing control to the next card for its layers, without simultaneous operations on a single layer across both GPUs. However, theoretically, this approach could increase the speed of parallel requests: for example, if the model is evenly distributed between two cards with similar performance, while one card processes the second request in its layers, the second request might already have been processed in the initial layers (on the first card) and continue to be processed in layers located on the second card. In this way, the second card does not remain idle waiting for it to process layers from a single request but can engage in processing other layers from another request, thus optimizing resource utilization. I hope my thoughts are understandable. )))

GiteaMirror commented

2026-04-12 15:26:38 -05:00

@rick-github commented on GitHub (Oct 5, 2024):

Yes, this is what I meant by batched completions. A single completion still takes the same amount of time, but you can queue multiple completions and the average completion time will be proportional to the inverse of the number of GPUs.

There is a parameter that llama.cpp takes that you can't set from ollama that determine how the model is split across GPUs, --split-mode. The choices are row and layer and may change the performance. I played around with this some time ago but didn't come to a definite conclusion, and then lost access to the system I was testing with.

@rick-github commented on GitHub (Oct 5, 2024): Yes, this is what I meant by batched completions. A single completion still takes the same amount of time, but you can queue multiple completions and the average completion time will be proportional to the inverse of the number of GPUs. There is a parameter that llama.cpp takes that you can't set from ollama that determine how the model is split across GPUs, `--split-mode`. The choices are `row` and `layer` and may change the performance. I played around with this some time ago but didn't come to a definite conclusion, and then lost access to the system I was testing with.

GiteaMirror commented

2026-04-12 15:26:38 -05:00

@varyagnord commented on GitHub (Oct 5, 2024):

Thus, I’m theoretically correct in my reasoning :) It remains to test it in practice ))) A huge thank you to everyone for participating in the discussion of this question :)

@varyagnord commented on GitHub (Oct 5, 2024): Thus, I’m theoretically correct in my reasoning :) It remains to test it in practice ))) A huge thank you to everyone for participating in the discussion of this question :)

GiteaMirror commented

2026-04-12 15:26:39 -05:00

@leoho0722 commented on GitHub (Mar 19, 2025):

Hi I am currently using Ollama as LLM inference backend in a multi-GPU environment and would like to ask if there is an upper limit on the number of GPUs when using OLLAMA_SCHED_SPREAD=1 for GPU Spilt?

@leoho0722 commented on GitHub (Mar 19, 2025): Hi I am currently using Ollama as LLM inference backend in a multi-GPU environment and would like to ask if there is an upper limit on the number of GPUs when using `OLLAMA_SCHED_SPREAD=1` for GPU Spilt?

GiteaMirror commented

2026-04-12 15:26:40 -05:00

@rick-github commented on GitHub (Mar 19, 2025):

If you set OLLAMA_SCHED_SPREAD=1 or the model is too large to fit on a single GPU, ollama will distribute the model evenly* across all available GPUs. Asterisk on "evenly" because if the GPUs are different, then there will be some variation. ollama will use a maximum of 16 devices, so 1 CPU + 15 GPUs is the default limit (https://github.com/ollama/ollama/issues/7148).

@rick-github commented on GitHub (Mar 19, 2025): If you set `OLLAMA_SCHED_SPREAD=1` or the model is too large to fit on a single GPU, ollama will distribute the model evenly* across all available GPUs. Asterisk on "evenly" because if the GPUs are different, then there will be some variation. ollama will use a maximum of 16 devices, so 1 CPU + 15 GPUs is the default limit (https://github.com/ollama/ollama/issues/7148).

GiteaMirror commented

2026-04-12 15:26:40 -05:00

@leoho0722 commented on GitHub (Mar 19, 2025):

If you set OLLAMA_SCHED_SPREAD=1 or the model is too large to fit on a single GPU, ollama will distribute the model evenly* across all available GPUs. Asterisk on "evenly" because if the GPUs are different, then there will be some variation. ollama will use a maximum of 16 devices, so 1 CPU + 15 GPUs is the default limit (#7148).

Thank you for your reply.

I would like to ask what would be some variation if the GPUs are different?

@leoho0722 commented on GitHub (Mar 19, 2025): > If you set `OLLAMA_SCHED_SPREAD=1` or the model is too large to fit on a single GPU, ollama will distribute the model evenly* across all available GPUs. Asterisk on "evenly" because if the GPUs are different, then there will be some variation. ollama will use a maximum of 16 devices, so 1 CPU + 15 GPUs is the default limit ([#7148](https://github.com/ollama/ollama/issues/7148)). Thank you for your reply. I would like to ask what would be some variation if the GPUs are different?

GiteaMirror commented

2026-04-12 15:26:41 -05:00

@rick-github commented on GitHub (Mar 19, 2025):

Probably better to say un-optimally.

If GPUs have different amounts of free VRAM then layer assignment will be affected. A GPU that is more capable than others in the system will not be prioritzed.

For example, say you have a 5090 with 32G and 2 3090s with 24G each, and want to load a model than needs 36G. ollama will assign 12G to each GPU, rather than load most into the fastest GPU, ie 32G in the 5090 and 4G in a 3090.

In other cases, if the amount of free RAM on a GPU doesn't meet a threshold (enough to hold a layer, KV cache, and graph) then it will be excluded completely, even if the CUDA backend would actually be able to load the required data structures using flash attention.

@rick-github commented on GitHub (Mar 19, 2025): Probably better to say un-optimally. If GPUs have different amounts of free VRAM then layer assignment will be affected. A GPU that is more capable than others in the system will not be prioritzed. For example, say you have a 5090 with 32G and 2 3090s with 24G each, and want to load a model than needs 36G. ollama will assign 12G to each GPU, rather than load most into the fastest GPU, ie 32G in the 5090 and 4G in a 3090. In other cases, if the amount of free RAM on a GPU doesn't meet a threshold (enough to hold a layer, KV cache, and graph) then it will be excluded completely, even if the CUDA backend would actually be able to load the required data structures using flash attention.

GiteaMirror commented

2026-04-12 15:26:41 -05:00

@leoho0722 commented on GitHub (Mar 21, 2025):

Probably better to say un-optimally.

If GPUs have different amounts of free VRAM then layer assignment will be affected. A GPU that is more capable than others in the system will not be prioritzed.

For example, say you have a 5090 with 32G and 2 3090s with 24G each, and want to load a model than needs 36G. ollama will assign 12G to each GPU, rather than load most into the fastest GPU, ie 32G in the 5090 and 4G in a 3090.

In other cases, if the amount of free RAM on a GPU doesn't meet a threshold (enough to hold a layer, KV cache, and graph) then it will be excluded completely, even if the CUDA backend would actually be able to load the required data structures using flash attention.

Thank you for your reply.

I think this should be helpful to me.

@leoho0722 commented on GitHub (Mar 21, 2025): > Probably better to say un-optimally. > > If GPUs have different amounts of free VRAM then layer assignment will be affected. A GPU that is more capable than others in the system will not be prioritzed. > > For example, say you have a 5090 with 32G and 2 3090s with 24G each, and want to load a model than needs 36G. ollama will assign 12G to each GPU, rather than load most into the fastest GPU, ie 32G in the 5090 and 4G in a 3090. > > In other cases, if the amount of free RAM on a GPU doesn't meet a threshold (enough to hold a layer, KV cache, and graph) then it will be excluded completely, even if the CUDA backend would actually be able to load the required data structures using flash attention. Thank you for your reply. I think this should be helpful to me.

GiteaMirror commented

2026-04-12 15:26:42 -05:00

@chrisoutwright commented on GitHub (Jul 6, 2025):

I have an RTX 4090 and an RTX 3090, but I run the 4090 in the second PCIe slot. Regardless of the PCIe slot ordering or any CUDA-visible configuration I set in my script, I see from MSI Afterburner (as shown in the image)

that 3090 GPU consistently pulls nearly 100 W more power than the other during load (I would want it other way round, the 3080 runs hotter). I can’t really explain why that is, given that the workload distribution should be even or, if anything, favor the more powerful 4090.

For context, I’m using a PowerShell script to configure different Ollama nodes with environment variables like CUDA_VISIBLE_DEVICES, OLLAMA_HOST, etc. However, the monitoring still shows this consistent power draw difference between the GPUs, and I’m not sure what’s causing it.
For sure this will impact the speed,, I may want to change which one outputs to display, or even changing the slots physically, but currently it seems strange why I cannot do this in software. Both Gpu run on pcie4.0 x8.
For reference .. in gaming the 4090 has not issue getting more watt, it seems only for ollama. I did also some embedding tasks and both will support >300w simultaneously with my PSU, so it is not about current draw per se.

@chrisoutwright commented on GitHub (Jul 6, 2025): I have an RTX 4090 and an RTX 3090, but I run the 4090 in the second PCIe slot. Regardless of the PCIe slot ordering or any CUDA-visible configuration I set in my script, I see from MSI Afterburner (as shown in the image) <img width="944" height="558" alt="Image" src="https://github.com/user-attachments/assets/a8427d3f-113f-48b8-8110-24967a2d6061" /> that 3090 GPU consistently pulls nearly 100 W more power than the other during load (I would want it other way round, the 3080 runs hotter). I can’t really explain why that is, given that the workload distribution should be even or, if anything, favor the more powerful 4090. For context, I’m using a PowerShell script to configure different Ollama nodes with environment variables like CUDA_VISIBLE_DEVICES, OLLAMA_HOST, etc. However, the monitoring still shows this consistent power draw difference between the GPUs, and I’m not sure what’s causing it. For sure this will impact the speed,, I may want to change which one outputs to display, or even changing the slots physically, but currently it seems strange why I cannot do this in software. Both Gpu run on pcie4.0 x8. For reference .. in gaming the 4090 has not issue getting more watt, it seems only for ollama. I did also some embedding tasks and both will support >300w simultaneously with my PSU, so it is not about current draw per se.

GiteaMirror commented

2026-04-12 15:26:43 -05:00

@rick-github commented on GitHub (Jul 6, 2025):

I'm assuming you want the difference in power draw explained.

I'm not familiar with MSI Afterburner v4.6.5 hardware monitor, but it looks like the power graphs are clipped at 150W, with average for the 3090 being 286.6W and the average for the 4090 being 198.4W, while the maximums are 408.2W and 446.7W respectively.

As you will recall from the discussion above, ollama distributes layers evenly across devices, so in the case where two devices of the same type are used (eg 2x4090), each device will consume half of the power required to generate a token. In the case where the devices are not the same type, the layers are still distributed evenly, but now the most power draw will be from the device that takes the longest time to complete half of the work. So the 3090 (which I assume is slower than a 4090) will spend more time processing the layers allocated to it. While the peak power usage is lower (408W), it uses it longer, hence drawing more power on average.

@rick-github commented on GitHub (Jul 6, 2025): I'm assuming you want the difference in power draw explained. I'm not familiar with MSI Afterburner v4.6.5 hardware monitor, but it looks like the power graphs are clipped at 150W, with average for the 3090 being 286.6W and the average for the 4090 being 198.4W, while the maximums are 408.2W and 446.7W respectively. As you will recall from the discussion above, ollama distributes layers evenly across devices, so in the case where two devices of the same type are used (eg 2x4090), each device will consume half of the power required to generate a token. In the case where the devices are not the same type, the layers are still distributed evenly, but now the most power draw will be from the device that takes the longest time to complete half of the work. So the 3090 (which I assume is slower than a 4090) will spend more time processing the layers allocated to it. While the peak power usage is lower (408W), it uses it longer, hence drawing more power on average.

GiteaMirror commented

2026-04-12 15:26:43 -05:00

@chrisoutwright commented on GitHub (Jul 6, 2025):

I'm assuming you want the difference in power draw explained.

I'm not familiar with MSI Afterburner v4.6.5 hardware monitor, but it looks like the power graphs are clipped at 150W, with average for the 3090 being 286.6W and the average for the 4090 being 198.4W, while the maximums are 408.2W and 446.7W respectively.

As you will recall from the discussion above, ollama distributes layers evenly across devices, so in the case where two devices of the same type are used (eg 2x4090), each device will consume half of the power required to generate a token. In the case where the devices are not the same type, the layers are still distributed evenly, but now the most power draw will be from the device that takes the longest time to complete half of the work. So the 3090 (which I assume is slower than a 4090) will spend more time processing the layers allocated to it. While the peak power usage is lower (408W), it uses it longer, hence drawing more power on average.

Would it make sense to change the layer scheduling logic so that layers are allocated proportionally to the processing speed of each GPU, rather than just splitting them evenly? For example, instead of a strict round-robin, perhaps scheduling 5 layers consecutively to the faster 4090 and then 2 to the 3090, or another dynamic allocation based on measured throughput? I mean if... there’s enough VRAM headroom, could we split layers proportionally to GPU speed and offload the KV cache to the slower GPU if it’s independent .. or to the faster GPU if KV handling is itself slower? If there’s no headroom, this might just return to the same power draw differences, but with some headroom it could be better optimized? Is this a avenue that could be worthwhile? I mean sure.. I am using different gpus but technically 4090 would be about 20% faster.. would be good it that could be used still. It is not that uncommon for hobbyists.

@chrisoutwright commented on GitHub (Jul 6, 2025): > I'm assuming you want the difference in power draw explained. > > I'm not familiar with MSI Afterburner v4.6.5 hardware monitor, but it looks like the power graphs are clipped at 150W, with average for the 3090 being 286.6W and the average for the 4090 being 198.4W, while the maximums are 408.2W and 446.7W respectively. > > As you will recall from the discussion above, ollama distributes layers evenly across devices, so in the case where two devices of the same type are used (eg 2x4090), each device will consume half of the power required to generate a token. In the case where the devices are not the same type, the layers are still distributed evenly, but now the most power draw will be from the device that takes the longest time to complete half of the work. So the 3090 (which I assume is slower than a 4090) will spend more time processing the layers allocated to it. While the peak power usage is lower (408W), it uses it longer, hence drawing more power on average. Would it make sense to change the layer scheduling logic so that layers are allocated proportionally to the processing speed of each GPU, rather than just splitting them evenly? For example, instead of a strict round-robin, perhaps scheduling 5 layers consecutively to the faster 4090 and then 2 to the 3090, or another dynamic allocation based on measured throughput? I mean if... there’s enough VRAM headroom, could we split layers proportionally to GPU speed and offload the KV cache to the slower GPU if it’s independent .. or to the faster GPU if KV handling is itself slower? If there’s no headroom, this might just return to the same power draw differences, but with some headroom it could be better optimized? Is this a avenue that could be worthwhile? I mean sure.. I am using different gpus but technically 4090 would be about 20% faster.. would be good it that could be used still. It is not that uncommon for hobbyists.

GiteaMirror commented

2026-04-12 15:26:45 -05:00

@rick-github commented on GitHub (Jul 7, 2025):

Older versions of ollama had a hook for this sort of fine-tuning, --tensor-split. That's no longer supported but it's not out of the realm of possibility that similar functionality could be added to the new ollama engine. In the meantime, #10678 and judicious use of device ordering in CUDA_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES would allow prioritizing layer assignment to the most powerful device.

@rick-github commented on GitHub (Jul 7, 2025): Older versions of ollama had a hook for this sort of fine-tuning, `--tensor-split`. That's no longer supported but it's not out of the realm of possibility that similar functionality could be added to the new ollama engine. In the meantime, #10678 and judicious use of device ordering in `CUDA_VISIBLE_DEVICES`/`ROCR_VISIBLE_DEVICES` would allow prioritizing layer assignment to the most powerful device.

GiteaMirror commented

2026-04-12 15:26:47 -05:00

@igorschlum commented on GitHub (Jul 7, 2025):

@rick-github Looking ahead, it would be powerful if this feature was developed with distributed computing in mind. The goal would be to allow Ollama to split an LLM's workload across multiple machines (e.g., several Mac Minis), with each computer handling different layers of the model. This would enable users to pool their collective RAM and CPU power, making it possible to run larger, more capable models on more accessible hardware.

@igorschlum commented on GitHub (Jul 7, 2025): @rick-github Looking ahead, it would be powerful if this feature was developed with distributed computing in mind. The goal would be to allow Ollama to split an LLM's workload across multiple machines (e.g., several Mac Minis), with each computer handling different layers of the model. This would enable users to pool their collective RAM and CPU power, making it possible to run larger, more capable models on more accessible hardware.

GiteaMirror commented

2026-04-12 15:26:47 -05:00

@citystrawman commented on GitHub (Oct 24, 2025):

Thus, I’m theoretically correct in my reasoning :) It remains to test it in practice ))) A huge thank you to everyone for participating in the discussion of this question :)

Hello, I am using ollama on a server with 4 RTX 4090. From your answer, may I assume that speeding up the number of tokens generated per second by using multiple GPUs may do little help? and the multiple GPUs could only help when processing multiple models?

@citystrawman commented on GitHub (Oct 24, 2025): > Thus, I’m theoretically correct in my reasoning :) It remains to test it in practice ))) A huge thank you to everyone for participating in the discussion of this question :) Hello, I am using ollama on a server with 4 RTX 4090. From your answer, may I assume that speeding up the number of tokens generated per second by using multiple GPUs may do little help? and the multiple GPUs could only help when processing multiple models?

GiteaMirror commented

2026-04-12 15:26:48 -05:00

@andrewdalpino commented on GitHub (Feb 13, 2026):

What about the kv cache, how is that distributed over multiple GPUs?

@andrewdalpino commented on GitHub (Feb 13, 2026): What about the kv cache, how is that distributed over multiple GPUs?

GiteaMirror referenced this issue

2026-04-12 23:33:42 -05:00

[PR #4909] [MERGED] Add ability to skip oneapi generate #11615

GiteaMirror referenced this issue

2026-04-16 05:46:23 -05:00

[PR #4909] [MERGED] Add ability to skip oneapi generate #16886

GiteaMirror referenced this issue

2026-04-19 16:07:39 -05:00

[PR #4909] [MERGED] Add ability to skip oneapi generate #22155

GiteaMirror referenced this issue

2026-04-22 06:59:12 -05:00

[GH-ISSUE #4511] Feature Request: Force-Off ROCm and CUDA builds in `gen_linux.sh` even if they are present. #28588

GiteaMirror referenced this issue

2026-04-22 22:11:49 -05:00

[PR #4909] [MERGED] Add ability to skip oneapi generate #37488

GiteaMirror referenced this issue

2026-04-24 22:35:10 -05:00

[PR #4909] [MERGED] Add ability to skip oneapi generate #42863

GiteaMirror referenced this issue

2026-04-28 11:20:57 -05:00

[GH-ISSUE #4511] Feature Request: Force-Off ROCm and CUDA builds in `gen_linux.sh` even if they are present. #49339

GiteaMirror referenced this issue

2026-04-29 13:09:34 -05:00

[PR #4909] [MERGED] Add ability to skip oneapi generate #58312

GiteaMirror referenced this issue

2026-05-03 19:03:45 -05:00

[GH-ISSUE #4511] Feature Request: Force-Off ROCm and CUDA builds in `gen_linux.sh` even if they are present. #64865

GiteaMirror referenced this issue

2026-05-05 05:50:43 -05:00

[PR #4909] [MERGED] Add ability to skip oneapi generate #73909

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#4511