[GH-ISSUE #2672] Do Ollama support multiple GPUs working simultaneously? #1588

Closed
opened 2026-04-12 11:31:04 -05:00 by GiteaMirror · 18 comments
Owner

Originally created by @papandadj on GitHub (Feb 22, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2672

I have 8 RTX 4090 GPUs. Can they support a 70B-int4 parameter model?

Originally created by @papandadj on GitHub (Feb 22, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2672 I have 8 RTX 4090 GPUs. Can they support a 70B-int4 parameter model?
Author
Owner

@luoluoter commented on GitHub (Feb 22, 2024):

you can try and see what happened.

just run it, i think they can handle 70B-int4 model.

for my machine(linux) with multi-GPUs, my ollama use all of them natively.

btw, i do not want use all GPUs, but i do not how to do for now.😂

<!-- gh-comment-id:1959138206 --> @luoluoter commented on GitHub (Feb 22, 2024): you can try and see what happened. just run it, i think they can handle 70B-int4 model. for my machine(linux) with multi-GPUs, my ollama use all of them natively. btw, i do not want use all GPUs, but i do not how to do for now.😂
Author
Owner

@aaronnewsome commented on GitHub (Feb 22, 2024):

I have 3x 4060Ti. Didn't do anything special and Ollama uses all GPUs.

<!-- gh-comment-id:1959505845 --> @aaronnewsome commented on GitHub (Feb 22, 2024): I have 3x 4060Ti. Didn't do anything special and Ollama uses all GPUs.
Author
Owner

@papandadj commented on GitHub (Feb 23, 2024):

Thinks, I will try it.

<!-- gh-comment-id:1960603157 --> @papandadj commented on GitHub (Feb 23, 2024): Thinks, I will try it.
Author
Owner

@lowstz commented on GitHub (Feb 23, 2024):

I have 4x 2080Ti 22G, it run very well, the model split to multi gpu
ref: https://x.com/lowstz/status/1758855507551633716

ollama's backend llama.cpp does not support concurrent processing, so you can run 3 instance 70b-int4 on 8x RTX 4090, set a haproxy/nginx load balancer for ollama api to improve performance.

Ollama 0.2 and later versions already have concurrency support

<!-- gh-comment-id:1960844298 --> @lowstz commented on GitHub (Feb 23, 2024): I have 4x 2080Ti 22G, it run very well, the model split to multi gpu ref: [https://x.com/lowstz/status/1758855507551633716](https://x.com/lowstz/status/1758855507551633716) ~~ollama's backend llama.cpp does not support concurrent processing, so you can run 3 instance 70b-int4 on 8x RTX 4090, set a haproxy/nginx load balancer for ollama api to improve performance.~~ Ollama 0.2 and later versions already have concurrency support
Author
Owner

@papandadj commented on GitHub (Feb 26, 2024):

我测试了,是支持最新的qwen72b-chat-q4_0,可以使用8个gpu,每个GPU差不多使用7G

<!-- gh-comment-id:1964003151 --> @papandadj commented on GitHub (Feb 26, 2024): 我测试了,是支持最新的qwen72b-chat-q4_0,可以使用8个gpu,每个GPU差不多使用7G
Author
Owner

@ICHarmony commented on GitHub (Mar 9, 2024):

I have 8 RTX 4090 GPUs. Can they support a 70B-int4 parameter model?

What chassis are you using to house them?

<!-- gh-comment-id:1986991044 --> @ICHarmony commented on GitHub (Mar 9, 2024): > I have 8 RTX 4090 GPUs. Can they support a 70B-int4 parameter model? What chassis are you using to house them?
Author
Owner

@twmht commented on GitHub (Mar 28, 2024):

@papandadj

Does ollma support qwen72b? How?

<!-- gh-comment-id:2024580284 --> @twmht commented on GitHub (Mar 28, 2024): @papandadj Does ollma support qwen72b? How?
Author
Owner

@ICHarmony commented on GitHub (Mar 28, 2024):

@papandadj

Does ollma support qwen72b? How?

If you have over 23 GB VRAM then Yes if any model has .gguf or pytorch/safetensor you can use it

https://github.com/ollama/ollama/blob/main/docs/import.md

https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GGUF/tree/main

https://medium.com/@sudarshan-koirala/ollama-huggingface-8e8bc55ce572

https://github.com/ollama/ollama/blob/main/docs/modelfile.md

TIP: Parameter stop is important and Ollama will copy the .gguf as it integrates the model so make sure you have enough storage space

<!-- gh-comment-id:2026314805 --> @ICHarmony commented on GitHub (Mar 28, 2024): > @papandadj > > Does ollma support qwen72b? How? If you have over 23 GB VRAM then Yes if any model has .gguf or pytorch/safetensor you can use it https://github.com/ollama/ollama/blob/main/docs/import.md https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GGUF/tree/main https://medium.com/@sudarshan-koirala/ollama-huggingface-8e8bc55ce572 https://github.com/ollama/ollama/blob/main/docs/modelfile.md TIP: Parameter stop is important and Ollama will copy the .gguf as it integrates the model so make sure you have enough storage space
Author
Owner

@MTDickens commented on GitHub (Dec 24, 2024):

you can try and see what happened.

just run it, i think they can handle 70B-int4 model.

for my machine(linux) with multi-GPUs, my ollama use all of them natively.

btw, i do not want use all GPUs, but i do not how to do for now.😂

Use CUDA_VISIBLE_DEVICES environment variable. You can easily find usages via Google.

<!-- gh-comment-id:2561241416 --> @MTDickens commented on GitHub (Dec 24, 2024): > you can try and see what happened. > > just run it, i think they can handle 70B-int4 model. > > for my machine(linux) with multi-GPUs, my ollama use all of them natively. > > btw, i do not want use all GPUs, but i do not how to do for now.😂 Use `CUDA_VISIBLE_DEVICES` environment variable. You can easily find usages via Google.
Author
Owner

@accqaz commented on GitHub (Jan 10, 2025):

我测试了,是支持最新的qwen72b-chat-q4_0,可以使用8个gpu,每个GPU差不多使用7G

您好!请问有什么命令可以设置吗?我是两张A6000,用ollama serve的时候只检测到一张gpu。我想让他用两张gpu,因为只用一张感觉推理的特别慢,我在想是不是因为加载模型就占了四十多个g,从而推理就变慢了。对了,我是拉取的qwen2.5-72b的模型,想请教一下~

<!-- gh-comment-id:2582550172 --> @accqaz commented on GitHub (Jan 10, 2025): > 我测试了,是支持最新的qwen72b-chat-q4_0,可以使用8个gpu,每个GPU差不多使用7G 您好!请问有什么命令可以设置吗?我是两张A6000,用ollama serve的时候只检测到一张gpu。我想让他用两张gpu,因为只用一张感觉推理的特别慢,我在想是不是因为加载模型就占了四十多个g,从而推理就变慢了。对了,我是拉取的qwen2.5-72b的模型,想请教一下~
Author
Owner

@vnicolici commented on GitHub (Jan 28, 2025):

I have a question, I have a 4090 with 24 GB VRAM, and when I use models with large contexts, for example 128k, Ollama refuses to use the GPU at all, it runs 100% on CPU. From what I understood this is normal, some context related data structures can't be split between the CPU and the GPU, and that prevents it from using the GPU when I set the context limit too high.

But what would happen if I had 2 GPUs with 24 GB of VRAM each? Would I be able to run larger context sizes, with both GPUs used in situations where before I couldn't with just one GPU? Would it allow me to double the context size, without reverting to CPU only, compared to the context sizes that are supported when having a single GPU?

My concern is that those context related data structures might needs to fit entirely on a single GPU, and can't be split between 2 GPUs, so even having 2 GPUs might not make any difference in behavior compared to a single GPU, and it might still use just the CPU in that situation.

<!-- gh-comment-id:2619744024 --> @vnicolici commented on GitHub (Jan 28, 2025): I have a question, I have a 4090 with 24 GB VRAM, and when I use models with large contexts, for example 128k, Ollama refuses to use the GPU at all, it runs 100% on CPU. From what I understood this is normal, some context related data structures can't be split between the CPU and the GPU, and that prevents it from using the GPU when I set the context limit too high. But what would happen if I had 2 GPUs with 24 GB of VRAM each? Would I be able to run larger context sizes, with both GPUs used in situations where before I couldn't with just one GPU? Would it allow me to double the context size, without reverting to CPU only, compared to the context sizes that are supported when having a single GPU? My concern is that those context related data structures might needs to fit entirely on a single GPU, and can't be split between 2 GPUs, so even having 2 GPUs might not make any difference in behavior compared to a single GPU, and it might still use just the CPU in that situation.
Author
Owner

@jithinagiboson commented on GitHub (Feb 5, 2025):

same for me tried to run qwen2.514b with 22k context ollma refused to use 2 x 4090 and just used 100% CPU

<!-- gh-comment-id:2635735705 --> @jithinagiboson commented on GitHub (Feb 5, 2025): same for me tried to run qwen2.514b with 22k context ollma refused to use 2 x 4090 and just used 100% CPU
Author
Owner

@NoskyOrg commented on GitHub (Feb 10, 2025):

same for me tried to run qwen2.514b with 22k context ollma refused to use 2 x 4090 and just used 100% CPU

https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-load-models-on-multiple-gpus

ollama serve --help
OLLAMA_SCHED_SPREAD Always schedule model across all GPUs

# 强制跨所有GPU调度模型 默认根据负载自动选择 因此可能只用单卡
Environment="OLLAMA_SCHED_SPREAD=1"
<!-- gh-comment-id:2646819490 --> @NoskyOrg commented on GitHub (Feb 10, 2025): > same for me tried to run qwen2.514b with 22k context ollma refused to use 2 x 4090 and just used 100% CPU https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-load-models-on-multiple-gpus ollama serve --help OLLAMA_SCHED_SPREAD Always schedule model across all GPUs ```ini # 强制跨所有GPU调度模型 默认根据负载自动选择 因此可能只用单卡 Environment="OLLAMA_SCHED_SPREAD=1" ```
Author
Owner

@NoskyOrg commented on GitHub (Feb 10, 2025):

我测试了,是支持最新的qwen72b-chat-q4_0,可以使用8个gpu,每个GPU差不多使用7G

您好!请问有什么命令可以设置吗?我是两张A6000,用ollama serve的时候只检测到一张gpu。我想让他用两张gpu,因为只用一张感觉推理的特别慢,我在想是不是因为加载模型就占了四十多个g,从而推理就变慢了。对了,我是拉取的qwen2.5-72b的模型,想请教一下~

https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-load-models-on-multiple-gpus

当加载新模型时,Ollama 会根据当前可用的显存评估模型所需的显存。如果模型完全可以装入任何单个 GPU,Ollama 将在该 GPU 上加载模型。这通常会提供最佳性能,因为它减少了推理过程中跨 PCI 总线传输的数据量。如果模型不能完全装入一个 GPU,那么它将分布在所有可用的 GPU 上。

ollama serve --help
OLLAMA_SCHED_SPREAD Always schedule model across all GPUs

# 强制跨所有GPU调度模型 默认根据负载自动选择 因此可能只用单卡
Environment="OLLAMA_SCHED_SPREAD=1"
<!-- gh-comment-id:2646821009 --> @NoskyOrg commented on GitHub (Feb 10, 2025): > > 我测试了,是支持最新的qwen72b-chat-q4_0,可以使用8个gpu,每个GPU差不多使用7G > > 您好!请问有什么命令可以设置吗?我是两张A6000,用ollama serve的时候只检测到一张gpu。我想让他用两张gpu,因为只用一张感觉推理的特别慢,我在想是不是因为加载模型就占了四十多个g,从而推理就变慢了。对了,我是拉取的qwen2.5-72b的模型,想请教一下~ https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-load-models-on-multiple-gpus > 当加载新模型时,Ollama 会根据当前可用的显存评估模型所需的显存。如果模型完全可以装入任何单个 GPU,Ollama 将在该 GPU 上加载模型。这通常会提供最佳性能,因为它减少了推理过程中跨 PCI 总线传输的数据量。如果模型不能完全装入一个 GPU,那么它将分布在所有可用的 GPU 上。 ollama serve --help OLLAMA_SCHED_SPREAD Always schedule model across all GPUs ```ini # 强制跨所有GPU调度模型 默认根据负载自动选择 因此可能只用单卡 Environment="OLLAMA_SCHED_SPREAD=1" ```
Author
Owner

@newcrane commented on GitHub (Mar 1, 2025):

我测试了,是支持最新的qwen72b-chat-q4_0,可以使用8个gpu,每个GPU差不多使用7G

您好!请问有什么命令可以设置吗?我是两张A6000,用ollama serve的时候只检测到一张gpu。我想让他用两张gpu,因为只用一张感觉推理的特别慢,我在想是不是因为加载模型就占了四十多个g,从而推理就变慢了。对了,我是拉取的qwen2.5-72b的模型,想请教一下~

环境变量:
Environment="CUDA_VISIBLE_DEVICES=0,1" 代表让ollama能识别到第0,1张显卡
Environment="OLLAMA_SCHED_SPREAD=1" 这几张卡均衡使用

重启服务
systemctl daemon-reload
systemctl restart ollama.service

<!-- gh-comment-id:2691912181 --> @newcrane commented on GitHub (Mar 1, 2025): > > 我测试了,是支持最新的qwen72b-chat-q4_0,可以使用8个gpu,每个GPU差不多使用7G > > 您好!请问有什么命令可以设置吗?我是两张A6000,用ollama serve的时候只检测到一张gpu。我想让他用两张gpu,因为只用一张感觉推理的特别慢,我在想是不是因为加载模型就占了四十多个g,从而推理就变慢了。对了,我是拉取的qwen2.5-72b的模型,想请教一下~ 环境变量: Environment="CUDA_VISIBLE_DEVICES=0,1" 代表让ollama能识别到第0,1张显卡 Environment="OLLAMA_SCHED_SPREAD=1" 这几张卡均衡使用 重启服务 systemctl daemon-reload systemctl restart ollama.service
Author
Owner

@yosgith commented on GitHub (Aug 4, 2025):

With LLMStudio you can enable and disable GPUs at will.

<!-- gh-comment-id:3149293300 --> @yosgith commented on GitHub (Aug 4, 2025): With LLMStudio you can enable and disable GPUs at will.
Author
Owner

@weathon commented on GitHub (Oct 10, 2025):

Environment="OLLAMA_SCHED_SPREAD=1"

This works for me thanks

<!-- gh-comment-id:3388641904 --> @weathon commented on GitHub (Oct 10, 2025): > Environment="OLLAMA_SCHED_SPREAD=1" This works for me thanks
Author
Owner

@citystrawman commented on GitHub (Oct 24, 2025):

我测试了,是支持最新的qwen72b-chat-q4_0,可以使用8个gpu,每个GPU差不多使用7G

您好!请问有什么命令可以设置吗?我是两张A6000,用ollama serve的时候只检测到一张gpu。我想让他用两张gpu,因为只用一张感觉推理的特别慢,我在想是不是因为加载模型就占了四十多个g,从而推理就变慢了。对了,我是拉取的qwen2.5-72b的模型,想请教一下~

环境变量: Environment="CUDA_VISIBLE_DEVICES=0,1" 代表让ollama能识别到第0,1张显卡 Environment="OLLAMA_SCHED_SPREAD=1" 这几张卡均衡使用

重启服务 systemctl daemon-reload systemctl restart ollama.service

您好, 我在ragflow上用deepseekr1:70b测试, 服务器使用4个4090显卡, 每个显卡有48g显存, 我用nvidia-smi监控的时候只有一张显卡在工作, 感觉推理速度比较慢. 我是用的docker容器跑的ollama, 请问怎样设置可以加快推理速度呢

<!-- gh-comment-id:3442442779 --> @citystrawman commented on GitHub (Oct 24, 2025): > > > 我测试了,是支持最新的qwen72b-chat-q4_0,可以使用8个gpu,每个GPU差不多使用7G > > > > > > 您好!请问有什么命令可以设置吗?我是两张A6000,用ollama serve的时候只检测到一张gpu。我想让他用两张gpu,因为只用一张感觉推理的特别慢,我在想是不是因为加载模型就占了四十多个g,从而推理就变慢了。对了,我是拉取的qwen2.5-72b的模型,想请教一下~ > > 环境变量: Environment="CUDA_VISIBLE_DEVICES=0,1" 代表让ollama能识别到第0,1张显卡 Environment="OLLAMA_SCHED_SPREAD=1" 这几张卡均衡使用 > > 重启服务 systemctl daemon-reload systemctl restart ollama.service 您好, 我在ragflow上用deepseekr1:70b测试, 服务器使用4个4090显卡, 每个显卡有48g显存, 我用nvidia-smi监控的时候只有一张显卡在工作, 感觉推理速度比较慢. 我是用的docker容器跑的ollama, 请问怎样设置可以加快推理速度呢
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1588