[GH-ISSUE #7104] Optimizing GPU Usage for AI Models: Splitting Workloads Across Multiple GPUs Even if the Model Fits in One GPU #4511

Closed
opened 2026-04-12 15:26:29 -05:00 by GiteaMirror · 21 comments
Owner

Originally created by @varyagnord on GitHub (Oct 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7104

I have a question about how Ollama works and its options for working with AI models. If there are 2 GPUs in a PC, for example, two RTX3090s, and we launch a model that has a size of 20GB VRAM, it will be loaded into one card, preferably the fastest one. This means that processing 20GB of data will be handled by approximately 10,500 CUDA cores. Is there an option to divide the model across both GPUs even if it fits on one? For example, if we split the model so that half (10GB) is processed by 10,500 CUDA cores from the first GPU and the other half by 10,500 CUDA cores from the second GPU. Then a total of 21,000 CUDA cores would process the model. Theoretically, this could improve performance. I understand that in this case, increased data exchange over the PCI-e bus might become a bottleneck, but even then such an approach could be faster. If this option does not exist yet, it might be worth implementing and experimenting with it. If it works, in the future (when using multiple GPUs with different numbers of CUDA cores), when dividing models, they should be divided proportionally to the number of CUDA cores to achieve maximum performance.

Originally created by @varyagnord on GitHub (Oct 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7104 I have a question about how Ollama works and its options for working with AI models. If there are 2 GPUs in a PC, for example, two RTX3090s, and we launch a model that has a size of 20GB VRAM, it will be loaded into one card, preferably the fastest one. This means that processing 20GB of data will be handled by approximately 10,500 CUDA cores. Is there an option to divide the model across both GPUs even if it fits on one? For example, if we split the model so that half (10GB) is processed by 10,500 CUDA cores from the first GPU and the other half by 10,500 CUDA cores from the second GPU. Then a total of 21,000 CUDA cores would process the model. Theoretically, this could improve performance. I understand that in this case, increased data exchange over the PCI-e bus might become a bottleneck, but even then such an approach could be faster. If this option does not exist yet, it might be worth implementing and experimenting with it. If it works, in the future (when using multiple GPUs with different numbers of CUDA cores), when dividing models, they should be divided proportionally to the number of CUDA cores to achieve maximum performance.
GiteaMirror added the feature request label 2026-04-12 15:26:29 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 5, 2024):

You can set OLLAMA_SCHED_SPREAD=1 in the server environment to have ollama not use a single GPU. However, this doesn't speed up inference for most cases because of the serial nature of inference. A model is a stack of layers, so inference needs to be completed in a layer before the results can be used to perform inference with the next layer. For a given completion, you can't have inference on layer x performed on one GPU while another GPU does inference on layer x+n. Multiple GPUs can help when you are doing multiple parallel completions, see OLLAMA_NUM_PARALLEL, or batched completions, where a queue of completions is processed serially and sequential portions of the model are loaded across multiple GPUs, allowing multiple concurrent completions running in a portion of the model. All of which is to say that you can split a model across multiple GPUs, but it won't speed up any individual completion.

There are use cases where multiple GPUs can be used to do parallel matrix ops in a single layer, but I don't know if llama.cpp implements that logic.

<!-- gh-comment-id:2394909592 --> @rick-github commented on GitHub (Oct 5, 2024): You can set `OLLAMA_SCHED_SPREAD=1` in the server environment to have ollama not use a single GPU. However, this doesn't speed up inference for most cases because of the serial nature of inference. A model is a stack of layers, so inference needs to be completed in a layer before the results can be used to perform inference with the next layer. For a given completion, you can't have inference on layer x performed on one GPU while another GPU does inference on layer x+n. Multiple GPUs can help when you are doing multiple parallel completions, see `OLLAMA_NUM_PARALLEL`, or batched completions, where a queue of completions is processed serially and sequential portions of the model are loaded across multiple GPUs, allowing multiple concurrent completions running in a portion of the model. All of which is to say that you can split a model across multiple GPUs, but it won't speed up any individual completion. There are use cases where multiple GPUs can be used to do parallel matrix ops in a single layer, but I don't know if llama.cpp implements that logic.
Author
Owner

@varyagnord commented on GitHub (Oct 5, 2024):

Thanks a million.

"There are use cases where multiple GPUs can be used to do parallel matrix ops in a single layer, but I don't know if llama.cpp implements that logic." - This part is also quite interesting. Has this logic been implemented, and is its implementation planned? Might someone have this information...?

<!-- gh-comment-id:2394965562 --> @varyagnord commented on GitHub (Oct 5, 2024): Thanks a million. "There are use cases where multiple GPUs can be used to do parallel matrix ops in a single layer, but I don't know if llama.cpp implements that logic." - This part is also quite interesting. Has this logic been implemented, and is its implementation planned? Might someone have this information...?
Author
Owner

@rick-github commented on GitHub (Oct 5, 2024):

I had a look at the PR that implemented multi-GPU support in llama.cpp and it says "Matrix multiplications are split across GPUs and done in parallel", so it sounds like this might be done. Unfortunately I don't have a multi-GPU system to test with.

<!-- gh-comment-id:2394979744 --> @rick-github commented on GitHub (Oct 5, 2024): I had a look at the PR that implemented multi-GPU support in llama.cpp and it says "Matrix multiplications are split across GPUs and done in parallel", so it sounds like this might be done. Unfortunately I don't have a multi-GPU system to test with.
Author
Owner

@varyagnord commented on GitHub (Oct 5, 2024):

I have a system with 3090 and 3080 GPUs installed. However, it is likely that I need to use some special environment variables to perform calculations in this way.

<!-- gh-comment-id:2394991713 --> @varyagnord commented on GitHub (Oct 5, 2024): I have a system with 3090 and 3080 GPUs installed. However, it is likely that I need to use some special environment variables to perform calculations in this way.
Author
Owner

@rick-github commented on GitHub (Oct 5, 2024):

My understanding is you just need to set OLLAMA_SCHED_SPREAD=1 in the server environment, restart the server and then load a model. In the logs you should see the runner started with a --tensor-split argument and you should be good to go.

<!-- gh-comment-id:2394994849 --> @rick-github commented on GitHub (Oct 5, 2024): My understanding is you just need to set `OLLAMA_SCHED_SPREAD=1` in the server environment, restart the server and then load a model. In the logs you should see the runner started with a `--tensor-split` argument and you should be good to go.
Author
Owner

@igorschlum commented on GitHub (Oct 5, 2024):

Hi @varyagnord your question is interesting. I made and answer and asked ChatGPT his advice.

Your answer is close, but the concept could be clarified for better accuracy and precision. Here’s a revised version of your response:

“As I understand it, GPUs already process tasks in parallel using thousands of CUDA cores. While it might seem that splitting the model across two GPUs would improve performance, in most cases, this approach does not necessarily speed up inference. The overhead of synchronizing the data between the GPUs, as well as potential bottlenecks in data transfer over the PCI-e bus, could offset the benefits of using additional CUDA cores.”

Regarding whether you’re right, in general, splitting models across multiple GPUs is typically done for larger models that exceed the VRAM capacity of a single GPU. If a model fits comfortably within the memory of one GPU, distributing it across two GPUs often adds complexity without a significant performance boost. You are correct that GPUs already use parallelism efficiently, but the added data exchange between GPUs can slow things down rather than accelerate them. However, some specialized frameworks may support multi-GPU inference with optimizations to reduce the overhead, though it’s not the default approach.

<!-- gh-comment-id:2395023251 --> @igorschlum commented on GitHub (Oct 5, 2024): Hi @varyagnord your question is interesting. I made and answer and asked ChatGPT his advice. Your answer is close, but the concept could be clarified for better accuracy and precision. Here’s a revised version of your response: “As I understand it, GPUs already process tasks in parallel using thousands of CUDA cores. While it might seem that splitting the model across two GPUs would improve performance, in most cases, this approach does not necessarily speed up inference. The overhead of synchronizing the data between the GPUs, as well as potential bottlenecks in data transfer over the PCI-e bus, could offset the benefits of using additional CUDA cores.” Regarding whether you’re right, in general, splitting models across multiple GPUs is typically done for larger models that exceed the VRAM capacity of a single GPU. If a model fits comfortably within the memory of one GPU, distributing it across two GPUs often adds complexity without a significant performance boost. You are correct that GPUs already use parallelism efficiently, but the added data exchange between GPUs can slow things down rather than accelerate them. However, some specialized frameworks may support multi-GPU inference with optimizations to reduce the overhead, though it’s not the default approach.
Author
Owner

@varyagnord commented on GitHub (Oct 5, 2024):

My understanding is you just need to set OLLAMA_SCHED_SPREAD=1 in the server environment, restart the server and then load a model. In the logs you should see the runner started with a --tensor-split argument and you should be good to go.

I proceeded in this manner and compared the performance. The number of tokens per second is only slightly lower when splitting a model that requires 6GB of video memory between two video cards, meaning each card gets approximately 3GB of layers. The decrease in performance is almost imperceptible, but it does exist; likely, the layers residing on the 3080 are processed slightly slower due to the fact that the 3080 has fewer CUDA cores. It appears that processing occurs layer by layer, with one card handling its own layers first and then passing control to the next card for its layers, without simultaneous operations on a single layer across both GPUs.

However, theoretically, this approach could increase the speed of parallel requests: for example, if the model is evenly distributed between two cards with similar performance, while one card processes the second request in its layers, the second request might already have been processed in the initial layers (on the first card) and continue to be processed in layers located on the second card. In this way, the second card does not remain idle waiting for it to process layers from a single request but can engage in processing other layers from another request, thus optimizing resource utilization. I hope my thoughts are understandable. )))

<!-- gh-comment-id:2395052068 --> @varyagnord commented on GitHub (Oct 5, 2024): > My understanding is you just need to set `OLLAMA_SCHED_SPREAD=1` in the server environment, restart the server and then load a model. In the logs you should see the runner started with a `--tensor-split` argument and you should be good to go. I proceeded in this manner and compared the performance. The number of tokens per second is only slightly lower when splitting a model that requires 6GB of video memory between two video cards, meaning each card gets approximately 3GB of layers. The decrease in performance is almost imperceptible, but it does exist; likely, the layers residing on the 3080 are processed slightly slower due to the fact that the 3080 has fewer CUDA cores. It appears that processing occurs layer by layer, with one card handling its own layers first and then passing control to the next card for its layers, without simultaneous operations on a single layer across both GPUs. However, theoretically, this approach could increase the speed of parallel requests: for example, if the model is evenly distributed between two cards with similar performance, while one card processes the second request in its layers, the second request might already have been processed in the initial layers (on the first card) and continue to be processed in layers located on the second card. In this way, the second card does not remain idle waiting for it to process layers from a single request but can engage in processing other layers from another request, thus optimizing resource utilization. I hope my thoughts are understandable. )))
Author
Owner

@rick-github commented on GitHub (Oct 5, 2024):

Yes, this is what I meant by batched completions. A single completion still takes the same amount of time, but you can queue multiple completions and the average completion time will be proportional to the inverse of the number of GPUs.

There is a parameter that llama.cpp takes that you can't set from ollama that determine how the model is split across GPUs, --split-mode. The choices are row and layer and may change the performance. I played around with this some time ago but didn't come to a definite conclusion, and then lost access to the system I was testing with.

<!-- gh-comment-id:2395061142 --> @rick-github commented on GitHub (Oct 5, 2024): Yes, this is what I meant by batched completions. A single completion still takes the same amount of time, but you can queue multiple completions and the average completion time will be proportional to the inverse of the number of GPUs. There is a parameter that llama.cpp takes that you can't set from ollama that determine how the model is split across GPUs, `--split-mode`. The choices are `row` and `layer` and may change the performance. I played around with this some time ago but didn't come to a definite conclusion, and then lost access to the system I was testing with.
Author
Owner

@varyagnord commented on GitHub (Oct 5, 2024):

Thus, I’m theoretically correct in my reasoning :) It remains to test it in practice ))) A huge thank you to everyone for participating in the discussion of this question :)

<!-- gh-comment-id:2395063713 --> @varyagnord commented on GitHub (Oct 5, 2024): Thus, I’m theoretically correct in my reasoning :) It remains to test it in practice ))) A huge thank you to everyone for participating in the discussion of this question :)
Author
Owner

@leoho0722 commented on GitHub (Mar 19, 2025):

Hi I am currently using Ollama as LLM inference backend in a multi-GPU environment and would like to ask if there is an upper limit on the number of GPUs when using OLLAMA_SCHED_SPREAD=1 for GPU Spilt?

<!-- gh-comment-id:2735645244 --> @leoho0722 commented on GitHub (Mar 19, 2025): Hi I am currently using Ollama as LLM inference backend in a multi-GPU environment and would like to ask if there is an upper limit on the number of GPUs when using `OLLAMA_SCHED_SPREAD=1` for GPU Spilt?
Author
Owner

@rick-github commented on GitHub (Mar 19, 2025):

If you set OLLAMA_SCHED_SPREAD=1 or the model is too large to fit on a single GPU, ollama will distribute the model evenly* across all available GPUs. Asterisk on "evenly" because if the GPUs are different, then there will be some variation. ollama will use a maximum of 16 devices, so 1 CPU + 15 GPUs is the default limit (https://github.com/ollama/ollama/issues/7148).

<!-- gh-comment-id:2736330331 --> @rick-github commented on GitHub (Mar 19, 2025): If you set `OLLAMA_SCHED_SPREAD=1` or the model is too large to fit on a single GPU, ollama will distribute the model evenly* across all available GPUs. Asterisk on "evenly" because if the GPUs are different, then there will be some variation. ollama will use a maximum of 16 devices, so 1 CPU + 15 GPUs is the default limit (https://github.com/ollama/ollama/issues/7148).
Author
Owner

@leoho0722 commented on GitHub (Mar 19, 2025):

If you set OLLAMA_SCHED_SPREAD=1 or the model is too large to fit on a single GPU, ollama will distribute the model evenly* across all available GPUs. Asterisk on "evenly" because if the GPUs are different, then there will be some variation. ollama will use a maximum of 16 devices, so 1 CPU + 15 GPUs is the default limit (#7148).

Thank you for your reply.

I would like to ask what would be some variation if the GPUs are different?

<!-- gh-comment-id:2736477718 --> @leoho0722 commented on GitHub (Mar 19, 2025): > If you set `OLLAMA_SCHED_SPREAD=1` or the model is too large to fit on a single GPU, ollama will distribute the model evenly* across all available GPUs. Asterisk on "evenly" because if the GPUs are different, then there will be some variation. ollama will use a maximum of 16 devices, so 1 CPU + 15 GPUs is the default limit ([#7148](https://github.com/ollama/ollama/issues/7148)). Thank you for your reply. I would like to ask what would be some variation if the GPUs are different?
Author
Owner

@rick-github commented on GitHub (Mar 19, 2025):

Probably better to say un-optimally.

If GPUs have different amounts of free VRAM then layer assignment will be affected. A GPU that is more capable than others in the system will not be prioritzed.

For example, say you have a 5090 with 32G and 2 3090s with 24G each, and want to load a model than needs 36G. ollama will assign 12G to each GPU, rather than load most into the fastest GPU, ie 32G in the 5090 and 4G in a 3090.

In other cases, if the amount of free RAM on a GPU doesn't meet a threshold (enough to hold a layer, KV cache, and graph) then it will be excluded completely, even if the CUDA backend would actually be able to load the required data structures using flash attention.

<!-- gh-comment-id:2736556693 --> @rick-github commented on GitHub (Mar 19, 2025): Probably better to say un-optimally. If GPUs have different amounts of free VRAM then layer assignment will be affected. A GPU that is more capable than others in the system will not be prioritzed. For example, say you have a 5090 with 32G and 2 3090s with 24G each, and want to load a model than needs 36G. ollama will assign 12G to each GPU, rather than load most into the fastest GPU, ie 32G in the 5090 and 4G in a 3090. In other cases, if the amount of free RAM on a GPU doesn't meet a threshold (enough to hold a layer, KV cache, and graph) then it will be excluded completely, even if the CUDA backend would actually be able to load the required data structures using flash attention.
Author
Owner

@leoho0722 commented on GitHub (Mar 21, 2025):

Probably better to say un-optimally.

If GPUs have different amounts of free VRAM then layer assignment will be affected. A GPU that is more capable than others in the system will not be prioritzed.

For example, say you have a 5090 with 32G and 2 3090s with 24G each, and want to load a model than needs 36G. ollama will assign 12G to each GPU, rather than load most into the fastest GPU, ie 32G in the 5090 and 4G in a 3090.

In other cases, if the amount of free RAM on a GPU doesn't meet a threshold (enough to hold a layer, KV cache, and graph) then it will be excluded completely, even if the CUDA backend would actually be able to load the required data structures using flash attention.

Thank you for your reply.

I think this should be helpful to me.

<!-- gh-comment-id:2743158001 --> @leoho0722 commented on GitHub (Mar 21, 2025): > Probably better to say un-optimally. > > If GPUs have different amounts of free VRAM then layer assignment will be affected. A GPU that is more capable than others in the system will not be prioritzed. > > For example, say you have a 5090 with 32G and 2 3090s with 24G each, and want to load a model than needs 36G. ollama will assign 12G to each GPU, rather than load most into the fastest GPU, ie 32G in the 5090 and 4G in a 3090. > > In other cases, if the amount of free RAM on a GPU doesn't meet a threshold (enough to hold a layer, KV cache, and graph) then it will be excluded completely, even if the CUDA backend would actually be able to load the required data structures using flash attention. Thank you for your reply. I think this should be helpful to me.
Author
Owner

@chrisoutwright commented on GitHub (Jul 6, 2025):

I have an RTX 4090 and an RTX 3090, but I run the 4090 in the second PCIe slot. Regardless of the PCIe slot ordering or any CUDA-visible configuration I set in my script, I see from MSI Afterburner (as shown in the image)

Image

that 3090 GPU consistently pulls nearly 100 W more power than the other during load (I would want it other way round, the 3080 runs hotter). I can’t really explain why that is, given that the workload distribution should be even or, if anything, favor the more powerful 4090.

For context, I’m using a PowerShell script to configure different Ollama nodes with environment variables like CUDA_VISIBLE_DEVICES, OLLAMA_HOST, etc. However, the monitoring still shows this consistent power draw difference between the GPUs, and I’m not sure what’s causing it.
For sure this will impact the speed,, I may want to change which one outputs to display, or even changing the slots physically, but currently it seems strange why I cannot do this in software. Both Gpu run on pcie4.0 x8.
For reference .. in gaming the 4090 has not issue getting more watt, it seems only for ollama. I did also some embedding tasks and both will support >300w simultaneously with my PSU, so it is not about current draw per se.

<!-- gh-comment-id:3041903105 --> @chrisoutwright commented on GitHub (Jul 6, 2025): I have an RTX 4090 and an RTX 3090, but I run the 4090 in the second PCIe slot. Regardless of the PCIe slot ordering or any CUDA-visible configuration I set in my script, I see from MSI Afterburner (as shown in the image) <img width="944" height="558" alt="Image" src="https://github.com/user-attachments/assets/a8427d3f-113f-48b8-8110-24967a2d6061" /> that 3090 GPU consistently pulls nearly 100 W more power than the other during load (I would want it other way round, the 3080 runs hotter). I can’t really explain why that is, given that the workload distribution should be even or, if anything, favor the more powerful 4090. For context, I’m using a PowerShell script to configure different Ollama nodes with environment variables like CUDA_VISIBLE_DEVICES, OLLAMA_HOST, etc. However, the monitoring still shows this consistent power draw difference between the GPUs, and I’m not sure what’s causing it. For sure this will impact the speed,, I may want to change which one outputs to display, or even changing the slots physically, but currently it seems strange why I cannot do this in software. Both Gpu run on pcie4.0 x8. For reference .. in gaming the 4090 has not issue getting more watt, it seems only for ollama. I did also some embedding tasks and both will support >300w simultaneously with my PSU, so it is not about current draw per se.
Author
Owner

@rick-github commented on GitHub (Jul 6, 2025):

I'm assuming you want the difference in power draw explained.

I'm not familiar with MSI Afterburner v4.6.5 hardware monitor, but it looks like the power graphs are clipped at 150W, with average for the 3090 being 286.6W and the average for the 4090 being 198.4W, while the maximums are 408.2W and 446.7W respectively.

As you will recall from the discussion above, ollama distributes layers evenly across devices, so in the case where two devices of the same type are used (eg 2x4090), each device will consume half of the power required to generate a token. In the case where the devices are not the same type, the layers are still distributed evenly, but now the most power draw will be from the device that takes the longest time to complete half of the work. So the 3090 (which I assume is slower than a 4090) will spend more time processing the layers allocated to it. While the peak power usage is lower (408W), it uses it longer, hence drawing more power on average.

<!-- gh-comment-id:3041995553 --> @rick-github commented on GitHub (Jul 6, 2025): I'm assuming you want the difference in power draw explained. I'm not familiar with MSI Afterburner v4.6.5 hardware monitor, but it looks like the power graphs are clipped at 150W, with average for the 3090 being 286.6W and the average for the 4090 being 198.4W, while the maximums are 408.2W and 446.7W respectively. As you will recall from the discussion above, ollama distributes layers evenly across devices, so in the case where two devices of the same type are used (eg 2x4090), each device will consume half of the power required to generate a token. In the case where the devices are not the same type, the layers are still distributed evenly, but now the most power draw will be from the device that takes the longest time to complete half of the work. So the 3090 (which I assume is slower than a 4090) will spend more time processing the layers allocated to it. While the peak power usage is lower (408W), it uses it longer, hence drawing more power on average.
Author
Owner

@chrisoutwright commented on GitHub (Jul 6, 2025):

I'm assuming you want the difference in power draw explained.

I'm not familiar with MSI Afterburner v4.6.5 hardware monitor, but it looks like the power graphs are clipped at 150W, with average for the 3090 being 286.6W and the average for the 4090 being 198.4W, while the maximums are 408.2W and 446.7W respectively.

As you will recall from the discussion above, ollama distributes layers evenly across devices, so in the case where two devices of the same type are used (eg 2x4090), each device will consume half of the power required to generate a token. In the case where the devices are not the same type, the layers are still distributed evenly, but now the most power draw will be from the device that takes the longest time to complete half of the work. So the 3090 (which I assume is slower than a 4090) will spend more time processing the layers allocated to it. While the peak power usage is lower (408W), it uses it longer, hence drawing more power on average.

Would it make sense to change the layer scheduling logic so that layers are allocated proportionally to the processing speed of each GPU, rather than just splitting them evenly? For example, instead of a strict round-robin, perhaps scheduling 5 layers consecutively to the faster 4090 and then 2 to the 3090, or another dynamic allocation based on measured throughput? I mean if... there’s enough VRAM headroom, could we split layers proportionally to GPU speed and offload the KV cache to the slower GPU if it’s independent .. or to the faster GPU if KV handling is itself slower? If there’s no headroom, this might just return to the same power draw differences, but with some headroom it could be better optimized? Is this a avenue that could be worthwhile? I mean sure.. I am using different gpus but technically 4090 would be about 20% faster.. would be good it that could be used still. It is not that uncommon for hobbyists.

<!-- gh-comment-id:3043035597 --> @chrisoutwright commented on GitHub (Jul 6, 2025): > I'm assuming you want the difference in power draw explained. > > I'm not familiar with MSI Afterburner v4.6.5 hardware monitor, but it looks like the power graphs are clipped at 150W, with average for the 3090 being 286.6W and the average for the 4090 being 198.4W, while the maximums are 408.2W and 446.7W respectively. > > As you will recall from the discussion above, ollama distributes layers evenly across devices, so in the case where two devices of the same type are used (eg 2x4090), each device will consume half of the power required to generate a token. In the case where the devices are not the same type, the layers are still distributed evenly, but now the most power draw will be from the device that takes the longest time to complete half of the work. So the 3090 (which I assume is slower than a 4090) will spend more time processing the layers allocated to it. While the peak power usage is lower (408W), it uses it longer, hence drawing more power on average. Would it make sense to change the layer scheduling logic so that layers are allocated proportionally to the processing speed of each GPU, rather than just splitting them evenly? For example, instead of a strict round-robin, perhaps scheduling 5 layers consecutively to the faster 4090 and then 2 to the 3090, or another dynamic allocation based on measured throughput? I mean if... there’s enough VRAM headroom, could we split layers proportionally to GPU speed and offload the KV cache to the slower GPU if it’s independent .. or to the faster GPU if KV handling is itself slower? If there’s no headroom, this might just return to the same power draw differences, but with some headroom it could be better optimized? Is this a avenue that could be worthwhile? I mean sure.. I am using different gpus but technically 4090 would be about 20% faster.. would be good it that could be used still. It is not that uncommon for hobbyists.
Author
Owner

@rick-github commented on GitHub (Jul 7, 2025):

Older versions of ollama had a hook for this sort of fine-tuning, --tensor-split. That's no longer supported but it's not out of the realm of possibility that similar functionality could be added to the new ollama engine. In the meantime, #10678 and judicious use of device ordering in CUDA_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES would allow prioritizing layer assignment to the most powerful device.

<!-- gh-comment-id:3044058864 --> @rick-github commented on GitHub (Jul 7, 2025): Older versions of ollama had a hook for this sort of fine-tuning, `--tensor-split`. That's no longer supported but it's not out of the realm of possibility that similar functionality could be added to the new ollama engine. In the meantime, #10678 and judicious use of device ordering in `CUDA_VISIBLE_DEVICES`/`ROCR_VISIBLE_DEVICES` would allow prioritizing layer assignment to the most powerful device.
Author
Owner

@igorschlum commented on GitHub (Jul 7, 2025):

@rick-github Looking ahead, it would be powerful if this feature was developed with distributed computing in mind. The goal would be to allow Ollama to split an LLM's workload across multiple machines (e.g., several Mac Minis), with each computer handling different layers of the model. This would enable users to pool their collective RAM and CPU power, making it possible to run larger, more capable models on more accessible hardware.

<!-- gh-comment-id:3044111684 --> @igorschlum commented on GitHub (Jul 7, 2025): @rick-github Looking ahead, it would be powerful if this feature was developed with distributed computing in mind. The goal would be to allow Ollama to split an LLM's workload across multiple machines (e.g., several Mac Minis), with each computer handling different layers of the model. This would enable users to pool their collective RAM and CPU power, making it possible to run larger, more capable models on more accessible hardware.
Author
Owner

@citystrawman commented on GitHub (Oct 24, 2025):

Thus, I’m theoretically correct in my reasoning :) It remains to test it in practice ))) A huge thank you to everyone for participating in the discussion of this question :)

Hello, I am using ollama on a server with 4 RTX 4090. From your answer, may I assume that speeding up the number of tokens generated per second by using multiple GPUs may do little help? and the multiple GPUs could only help when processing multiple models?

<!-- gh-comment-id:3442167488 --> @citystrawman commented on GitHub (Oct 24, 2025): > Thus, I’m theoretically correct in my reasoning :) It remains to test it in practice ))) A huge thank you to everyone for participating in the discussion of this question :) Hello, I am using ollama on a server with 4 RTX 4090. From your answer, may I assume that speeding up the number of tokens generated per second by using multiple GPUs may do little help? and the multiple GPUs could only help when processing multiple models?
Author
Owner

@andrewdalpino commented on GitHub (Feb 13, 2026):

What about the kv cache, how is that distributed over multiple GPUs?

<!-- gh-comment-id:3895249189 --> @andrewdalpino commented on GitHub (Feb 13, 2026): What about the kv cache, how is that distributed over multiple GPUs?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4511