[GH-ISSUE #7479] 请问，我想让ollama每次调用模型的时候，默认全部加载到GPU，有什么办法吗？ #51266

New Issue

GiteaMirror · 2026-04-28T19:06:52-05:00

GiteaMirror commented

2026-04-28 19:06:52 -05:00

Originally created by @fg2501 on GitHub (Nov 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7479

What is the issue?

每次调用的时候，经常会出现，GPU调用不到百分百，有时候一半CPU，一般GPU，有的时候甚至全部调用CPU，有办法强制只调用GPU吗？
还有，加载的GPU，默认5分钟之后卸载，我能改成10分钟之后再卸载，或者使其一直处于加载状态吗？

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.1.29

Originally created by @fg2501 on GitHub (Nov 3, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7479 ### What is the issue? ![fe3abf5e-4911-4713-86c8-8f669ba9f838](https://github.com/user-attachments/assets/03427162-b4e6-4af9-ba3f-a46c94899a5c) 每次调用的时候，经常会出现，GPU调用不到百分百，有时候一半CPU，一般GPU，有的时候甚至全部调用CPU，有办法强制只调用GPU吗？还有，加载的GPU，默认5分钟之后卸载，我能改成10分钟之后再卸载，或者使其一直处于加载状态吗？ ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.29

GiteaMirror added the question label 2026-04-28 19:06:52 -05:00

GiteaMirror closed this issue

2026-04-28 19:07:02 -05:00

GiteaMirror commented

2026-04-28 19:07:03 -05:00

@rick-github commented on GitHub (Nov 3, 2024):

Set OLLAMA_KEEP_ALIVE=-1 to stop the model from being unloaded.

ollama uses as much of the GPU as it can. If the GPU is full, part of the model will be run in CPU. If you want to run a model only in GPU, use a smaller model or get a bigger GPU.

@rick-github commented on GitHub (Nov 3, 2024): Set [`OLLAMA_KEEP_ALIVE=-1`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately) to stop the model from being unloaded. ollama uses as much of the GPU as it can. If the GPU is full, part of the model will be run in CPU. If you want to run a model only in GPU, use a smaller model or get a bigger GPU.

GiteaMirror commented

2026-04-28 19:07:04 -05:00

@fg2501 commented on GitHub (Nov 4, 2024):

设置 OLLAMA_KEEP_ALIVE=-1 以阻止模型被卸载。

奥拉马尽可能使用 GPU。如果 GPU 已满，部分模型将在 CPU 上运行。如果您只想在 GPU 上运行模型，请使用更小的模型或购买更大的 GPU。

好的，非常感谢！关键是，在GPU没满的情况，ollama有的时候，照样会只调用CPU啊，如果是尽可能的调用GPU，那我就不问了。

@fg2501 commented on GitHub (Nov 4, 2024): > 设置 `OLLAMA_KEEP_ALIVE=-1` 以阻止模型被卸载。 > > 奥拉马尽可能使用 GPU。如果 GPU 已满，部分模型将在 CPU 上运行。如果您只想在 GPU 上运行模型，请使用更小的模型或购买更大的 GPU。好的，非常感谢！关键是，在GPU没满的情况，ollama有的时候，照样会只调用CPU啊，如果是尽可能的调用GPU，那我就不问了。

GiteaMirror commented

2026-04-28 19:07:04 -05:00

@rick-github commented on GitHub (Nov 4, 2024):

ollama uses as much of the GPU as it can.
whatever doesn't fit in the GPU will run on the CPU.
a model is a series of layers. the layers are processed sequentially.
operations on the GPU will run faster than operations on the CPU.
because operations on GPU are faster, more time is spent in CPU operations.
the GPU is idle while waiting for the CPU operations to complete.
ollama calls the GPU as much as it can, but the bottleneck is the CPU.
if you want the GPU to be used 100%, use a model that fits in the GPU.

@rick-github commented on GitHub (Nov 4, 2024): ollama uses as much of the GPU as it can. whatever doesn't fit in the GPU will run on the CPU. a model is a series of layers. the layers are processed sequentially. operations on the GPU will run faster than operations on the CPU. because operations on GPU are faster, more time is spent in CPU operations. the GPU is idle while waiting for the CPU operations to complete. ollama calls the GPU as much as it can, but the bottleneck is the CPU. if you want the GPU to be used 100%, use a model that fits in the GPU.

GiteaMirror commented

2026-04-28 19:07:05 -05:00

@fg2501 commented on GitHub (Nov 6, 2024):

奥拉马尽可能多地使用 GPU。什么不适合在 GPU 上运行的，就会在 CPU 上运行。模型是一系列层。层是按顺序处理的。GPU 上的操作将比 CPU 上的操作运行得更快。因为 GPU 上的操作更快，所以在 CPU 操作上花费的时间更多。GPU 在等待 CPU 操作完成时处于空闲状态。GPU 被 ollama 调用得越多，瓶颈就在 CPU。如果你想让 GPU 使用率达到 100%，请使用适合 GPU 的模型。

Uploading aa28b43a-fa54-4521-bdc6-d1f67bf7028e.png…
我注意到一个现象，就是我有一个模型，他是4b的模型，模型文件只有8G，但是却在加载的时候，显示为39G的大小，请问，这是什么原因呢？这个模型原本的名字叫minicpm3，是面壁智能下的模型。

@fg2501 commented on GitHub (Nov 6, 2024): > 奥拉马尽可能多地使用 GPU。什么不适合在 GPU 上运行的，就会在 CPU 上运行。模型是一系列层。层是按顺序处理的。GPU 上的操作将比 CPU 上的操作运行得更快。因为 GPU 上的操作更快，所以在 CPU 操作上花费的时间更多。GPU 在等待 CPU 操作完成时处于空闲状态。GPU 被 ollama 调用得越多，瓶颈就在 CPU。如果你想让 GPU 使用率达到 100%，请使用适合 GPU 的模型。 ![Uploading aa28b43a-fa54-4521-bdc6-d1f67bf7028e.png…]() 我注意到一个现象，就是我有一个模型，他是4b的模型，模型文件只有8G，但是却在加载的时候，显示为39G的大小，请问，这是什么原因呢？这个模型原本的名字叫minicpm3，是面壁智能下的模型。

GiteaMirror commented

2026-04-28 19:07:06 -05:00

@fg2501 commented on GitHub (Nov 6, 2024):

@fg2501 commented on GitHub (Nov 6, 2024): ![aa28b43a-fa54-4521-bdc6-d1f67bf7028e](https://github.com/user-attachments/assets/e3f8af1c-da92-4c84-8f4c-e37d8f462fc1)

GiteaMirror commented

2026-04-28 19:07:07 -05:00

@fg2501 commented on GitHub (Nov 6, 2024):

我突然懂了，应该是我设置的上下文太长了，谢谢，这个问题我清楚了。

@fg2501 commented on GitHub (Nov 6, 2024): 我突然懂了，应该是我设置的上下文太长了，谢谢，这个问题我清楚了。

GiteaMirror commented

2026-04-28 19:07:07 -05:00

@fg2501 commented on GitHub (Nov 6, 2024):

我刚刚又检查了一下，这不是上下文长度的问题，算了，不搞了，我已经把这个模型卸载了。

@fg2501 commented on GitHub (Nov 6, 2024): 我刚刚又检查了一下，这不是上下文长度的问题，算了，不搞了，我已经把这个模型卸载了。

GiteaMirror commented

2026-04-28 19:07:08 -05:00

@morika546 commented on GitHub (Nov 6, 2024):

我也遇到了类似的情况，但我是a卡6800xt，明明有16g专用显存，占用总显存14g的情况下，非要9g加载进共享显存，5g加载进专用显存，导致速度很慢，请问你换模型解决问题了吗？

@morika546 commented on GitHub (Nov 6, 2024): 我也遇到了类似的情况，但我是a卡6800xt，明明有16g专用显存，占用总显存14g的情况下，非要9g加载进共享显存，5g加载进专用显存，导致速度很慢，请问你换模型解决问题了吗？

GiteaMirror commented

2026-04-28 19:07:08 -05:00

@fg2501 commented on GitHub (Nov 7, 2024):

我也遇到了类似的情况，但我是a卡6800xt，明明有16g专用显存，占用总显存14g的情况下，非要9g加载进共享显存，5g加载进专用显存，导致速度很慢，请问你换模型解决问题了吗？

没有解决，还是这样

@fg2501 commented on GitHub (Nov 7, 2024): > 我也遇到了类似的情况，但我是a卡6800xt，明明有16g专用显存，占用总显存14g的情况下，非要9g加载进共享显存，5g加载进专用显存，导致速度很慢，请问你换模型解决问题了吗？没有解决，还是这样

GiteaMirror commented

2026-04-28 19:07:09 -05:00

@yzwou commented on GitHub (Nov 11, 2024):

Set OLLAMA_KEEP_ALIVE=-1 to stop the model from being unloaded.

ollama uses as much of the GPU as it can. If the GPU is full, part of the model will be run in CPU. If you want to run a model only in GPU, use a smaller model or get a bigger GPU.

我在powershell中运行set OLLAMA_KEEP_ALIVE=-1但是就算是重启了没有效果，请问如何解决？
输入ollama ps时还是会显示4 minutes from now而不是forever

@yzwou commented on GitHub (Nov 11, 2024): > Set [`OLLAMA_KEEP_ALIVE=-1`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately) to stop the model from being unloaded. > > ollama uses as much of the GPU as it can. If the GPU is full, part of the model will be run in CPU. If you want to run a model only in GPU, use a smaller model or get a bigger GPU. 我在powershell中运行set OLLAMA_KEEP_ALIVE=-1但是就算是重启了没有效果，请问如何解决？输入ollama ps时还是会显示4 minutes from now而不是forever

GiteaMirror commented

2026-04-28 19:07:10 -05:00

@rick-github commented on GitHub (Nov 11, 2024):

https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-windows

@rick-github commented on GitHub (Nov 11, 2024): https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-windows

GiteaMirror commented

2026-04-28 19:07:12 -05:00

@yzwou commented on GitHub (Nov 11, 2024):

https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-windows

感谢

@yzwou commented on GitHub (Nov 11, 2024): > https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-windows 感谢

GiteaMirror commented

2026-04-28 19:07:14 -05:00

@sudocodus commented on GitHub (Dec 1, 2024):

我也遇到了类似的情况，但我是a卡6800xt，明明有16g专用显存，占用总显存14g的情况下，非要9g加载进共享显存，5g加载进专用显存，导致速度很慢，请问你换模型解决问题了吗？

你的问题解决了吗？

@sudocodus commented on GitHub (Dec 1, 2024): > 我也遇到了类似的情况，但我是a卡6800xt，明明有16g专用显存，占用总显存14g的情况下，非要9g加载进共享显存，5g加载进专用显存，导致速度很慢，请问你换模型解决问题了吗？你的问题解决了吗？

GiteaMirror commented

2026-04-28 19:07:15 -05:00

@fjzphch commented on GitHub (Jan 12, 2025):

我也遇到了类似的情况，但我是a卡6800xt，明明有16g专用显存，占用总显存14g的情况下，非要9g加载进共享显存，5g加载进专用显存，导致速度很慢，请问你换模型解决问题了吗？

我找到解决办法了。如果使用open webui的话，可以修改高级参数里num_gpu (Ollama)的值。我用6750gre12g，9b的模型，改到30附近速度就很快，多了少了都不行。

@fjzphch commented on GitHub (Jan 12, 2025): > 我也遇到了类似的情况，但我是a卡6800xt，明明有16g专用显存，占用总显存14g的情况下，非要9g加载进共享显存，5g加载进专用显存，导致速度很慢，请问你换模型解决问题了吗？我找到解决办法了。如果使用open webui的话，可以修改高级参数里num_gpu (Ollama)的值。我用6750gre12g，9b的模型，改到30附近速度就很快，多了少了都不行。

GiteaMirror commented

2026-04-28 19:07:15 -05:00

@Angel0726 commented on GitHub (Feb 7, 2025):

ollama ps显示100%GPU。但是运行nvidia-smi命令查看占用情况，发现GPU占用一直为0；cpu占用为50%。模型运行速度一般，那到底占用GPU了吗？
显卡A100 40G
模型 deep-seek:32B

@Angel0726 commented on GitHub (Feb 7, 2025): `ollama ps`显示100%GPU。但是运行`nvidia-smi`命令查看占用情况，发现GPU占用一直为0；cpu占用为50%。模型运行速度一般，那到底占用GPU了吗？显卡A100 40G 模型 deep-seek:32B

GiteaMirror commented

2026-04-28 19:07:16 -05:00

@hhwilliam commented on GitHub (Feb 8, 2025):

ubuntu 20.04 GPU V100 ,驱动版本440 CUDA版本10.2，运行deepseek-r1:7B模型时，nvidia-smi查看gpu使用率为0%，显示没有进程使用GPU资源，token生成速率还可以，ollama ps看100%GPU，但运行uptime看cpu负载飙升，怎么强制使用GPU呢请问。

@hhwilliam commented on GitHub (Feb 8, 2025): ubuntu 20.04 GPU V100 ,驱动版本440 CUDA版本10.2，运行deepseek-r1:7B模型时，nvidia-smi查看gpu使用率为0%，显示没有进程使用GPU资源，token生成速率还可以，ollama ps看100%GPU，但运行uptime看cpu负载飙升，怎么强制使用GPU呢请问。

GiteaMirror commented

2026-04-28 19:07:17 -05:00

@Angel0726 commented on GitHub (Feb 8, 2025):

ubuntu 20.04 GPU V100 ,驱动版本440 CUDA版本10.2，运行deepseek-r1:7B模型时，nvidia-smi查看gpu使用率为0%，显示没有进程使用GPU资源，token生成速率还可以，ollama ps看100%GPU，但运行uptime看cpu负载飙升，怎么强制使用GPU呢请问。

找到问题了。我的问题是ollama安装有问题。离线安装的

@Angel0726 commented on GitHub (Feb 8, 2025): > ubuntu 20.04 GPU V100 ,驱动版本440 CUDA版本10.2，运行deepseek-r1:7B模型时，nvidia-smi查看gpu使用率为0%，显示没有进程使用GPU资源，token生成速率还可以，ollama ps看100%GPU，但运行uptime看cpu负载飙升，怎么强制使用GPU呢请问。 > 找到问题了。我的问题是ollama安装有问题。离线安装的

GiteaMirror commented

2026-04-28 19:07:21 -05:00

@lindsaymorgan commented on GitHub (Feb 8, 2025):

ubuntu 20.04 GPU V100 ,驱动版本440 CUDA版本10.2，运行deepseek-r1:7B模型时，nvidia-smi查看gpu使用率为0%，显示没有进程使用GPU资源，token生成速率还可以，ollama ps看100%GPU，但运行uptime看cpu负载飙升，怎么强制使用GPU呢请问。

找到问题了。我的问题是ollama安装有问题。离线安装的

后来是怎么解决的方便介绍一下吗？我也遇到了相同的问题。

@lindsaymorgan commented on GitHub (Feb 8, 2025): > > ubuntu 20.04 GPU V100 ,驱动版本440 CUDA版本10.2，运行deepseek-r1:7B模型时，nvidia-smi查看gpu使用率为0%，显示没有进程使用GPU资源，token生成速率还可以，ollama ps看100%GPU，但运行uptime看cpu负载飙升，怎么强制使用GPU呢请问。 > > 找到问题了。我的问题是ollama安装有问题。离线安装的后来是怎么解决的方便介绍一下吗？我也遇到了相同的问题。

GiteaMirror commented

2026-04-28 19:07:24 -05:00

@ZongXR commented on GitHub (Feb 8, 2025):

ubuntu 20.04 GPU V100 ,驱动版本440 CUDA版本10.2，运行deepseek-r1:7B模型时，nvidia-smi查看gpu使用率为0%，显示没有进程使用GPU资源，token生成速率还可以，ollama ps看100%GPU，但运行uptime看cpu负载飙升，怎么强制使用GPU呢请问。

找到问题了。我的问题是ollama安装有问题。离线安装的

请问怎么解决的？

@ZongXR commented on GitHub (Feb 8, 2025): > > ubuntu 20.04 GPU V100 ,驱动版本440 CUDA版本10.2，运行deepseek-r1:7B模型时，nvidia-smi查看gpu使用率为0%，显示没有进程使用GPU资源，token生成速率还可以，ollama ps看100%GPU，但运行uptime看cpu负载飙升，怎么强制使用GPU呢请问。 > > 找到问题了。我的问题是ollama安装有问题。离线安装的请问怎么解决的？

GiteaMirror commented

2026-04-28 19:07:29 -05:00

@lumos0 commented on GitHub (Feb 13, 2025):

同样遇到此问题，未解决

@lumos0 commented on GitHub (Feb 13, 2025): 同样遇到此问题，未解决

GiteaMirror commented

2026-04-28 19:07:31 -05:00

@Hugo-san commented on GitHub (Feb 14, 2025):

同样遇到此问题，未解决

by updating my ollama from 0.5.7 to 0.5.10, the problem was solved for me when encountering this same problem.

@Hugo-san commented on GitHub (Feb 14, 2025): > 同样遇到此问题，未解决 by updating my ollama from 0.5.7 to 0.5.10, the problem was solved for me when encountering this same problem.

GiteaMirror commented

2026-04-28 19:07:32 -05:00

@lumos0 commented on GitHub (Feb 14, 2025):

同样遇到此问题，未解决

原因还是在安装上。
重装ollama后解决，之前是离线安装。重装时，按照在线安装脚本，手动修改了其中需要连网的部分，再安装后解决。

@lumos0 commented on GitHub (Feb 14, 2025): > 同样遇到此问题，未解决原因还是在安装上。重装ollama后解决，之前是离线安装。重装时，按照[在线安装](https://ollama.com/install.sh)脚本，手动修改了其中需要连网的部分，再安装后解决。

GiteaMirror commented

2026-04-28 19:07:33 -05:00

@bindianzhiyan commented on GitHub (Feb 17, 2025):

有人遇到这个嘛，已经配置了负载

@bindianzhiyan commented on GitHub (Feb 17, 2025): ![Image](https://github.com/user-attachments/assets/f3d16e10-f105-434d-ab2f-a08ba10b6ed9) 有人遇到这个嘛，已经配置了负载

GiteaMirror commented

2026-04-28 19:07:34 -05:00

@rick-github commented on GitHub (Feb 17, 2025):

Upgrade ollama.

@rick-github commented on GitHub (Feb 17, 2025): Upgrade ollama.

GiteaMirror commented

2026-04-28 19:07:35 -05:00

@weeee4 commented on GitHub (Feb 18, 2025):

ubuntu 20.04 GPU T4 ,驱动版本440 CUDA版本10.2，运行deepseek-r1:7B模型时，nvidia-smi查看gpu使用率为0%，显示没有进程使用GPU资源，ollama ps看100%CPU。重装ollama也不能解决.请问如何处理？

@weeee4 commented on GitHub (Feb 18, 2025): ubuntu 20.04 GPU T4 ,驱动版本440 CUDA版本10.2，运行deepseek-r1:7B模型时，nvidia-smi查看gpu使用率为0%，显示没有进程使用GPU资源，ollama ps看100%CPU。重装ollama也不能解决.请问如何处理？

GiteaMirror commented

2026-04-28 19:07:37 -05:00

@rick-github commented on GitHub (Feb 18, 2025):

Open a new ticket, add server logs.

@rick-github commented on GitHub (Feb 18, 2025): Open a new ticket, add [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues).

GiteaMirror commented

2026-04-28 19:07:39 -05:00

@sunt1009 commented on GitHub (Feb 19, 2025):

离线安装的ollama,服务器无法上网只能离线安装，目前是4张A40卡，运行的时候只使用一张卡98%，其余三张为0%，ollama版本号:0.5.11,有大佬遇到过这类问题吗？

@sunt1009 commented on GitHub (Feb 19, 2025): 离线安装的ollama,服务器无法上网只能离线安装，目前是4张A40卡，运行的时候只使用一张卡98%，其余三张为0%，ollama版本号:0.5.11,有大佬遇到过这类问题吗？

GiteaMirror commented

2026-04-28 19:07:40 -05:00

@cxlGiraffe commented on GitHub (Feb 19, 2025):

同样遇到此问题，未解决

原因还是在安装上。重装ollama后解决，之前是离线安装。重装时，按照在线安装脚本，手动修改了其中需要连网的部分，再安装后解决。

大佬能分享下方法吗

@cxlGiraffe commented on GitHub (Feb 19, 2025): > > 同样遇到此问题，未解决 > > 原因还是在安装上。重装ollama后解决，之前是离线安装。重装时，按照[在线安装](https://ollama.com/install.sh)脚本，手动修改了其中需要连网的部分，再安装后解决。大佬能分享下方法吗

GiteaMirror commented

2026-04-28 19:07:41 -05:00

@rick-github commented on GitHub (Feb 19, 2025):

@sunt1009 This is normal. If the model fits on one GPU, only one GPU is used. There is no performance advantage using multiple GPUs for a single completion, see here.

@rick-github commented on GitHub (Feb 19, 2025): @sunt1009 This is normal. If the model fits on one GPU, only one GPU is used. There is no performance advantage using multiple GPUs for a single completion, see [here](https://github.com/ollama/ollama/issues/7648#issuecomment-2473561990).

GiteaMirror commented

2026-04-28 19:07:42 -05:00

@sunt1009 commented on GitHub (Feb 21, 2025):

@rick-github 目前现象十几个人访问，一个GPU显示100%，其余空闲，导致只有1-2个人可以访问，其余人都是等待状态，这个不正常吧！我的预期是访问一个GPU占用98%以上，后面有人发起请求理论应该使用空闲的GPU才合理吧

@sunt1009 commented on GitHub (Feb 21, 2025): @rick-github 目前现象十几个人访问，一个GPU显示100%，其余空闲，导致只有1-2个人可以访问，其余人都是等待状态，这个不正常吧！我的预期是访问一个GPU占用98%以上，后面有人发起请求理论应该使用空闲的GPU才合理吧

GiteaMirror commented

2026-04-28 19:07:42 -05:00

@rick-github commented on GitHub (Feb 21, 2025):

Set OLLAMA_NUM_PARALLEL to as many concurrent requests you want to handle. Or, if the model fits on one GPU, use multiple servers as I already pointed out here.

@rick-github commented on GitHub (Feb 21, 2025): Set `OLLAMA_NUM_PARALLEL` to as many concurrent requests you want to handle. Or, if the model fits on one GPU, use multiple servers as I already pointed out [here](https://github.com/ollama/ollama/issues/7648#issuecomment-2473561990).

GiteaMirror commented

2026-04-28 19:07:43 -05:00

@fg2501 commented on GitHub (Mar 3, 2025):

我也遇到了类似的情况，但我是a卡6800xt，明明有16g专用显存，占用总显存14g的情况下，非要9g加载进共享显存，5g加载进专用显存，导致速度很慢，请问你换模型解决问题了吗？

我今天突然解决了，就是在配置文件那里增加参数，指定GPU加载多少层，PARAMETER num_gpu 100，我现在都是这样设置，这样一般来说，他都会加载满，当然，如果你的模型很大，那么，你需要再增加这个层数。

@fg2501 commented on GitHub (Mar 3, 2025): > 我也遇到了类似的情况，但我是a卡6800xt，明明有16g专用显存，占用总显存14g的情况下，非要9g加载进共享显存，5g加载进专用显存，导致速度很慢，请问你换模型解决问题了吗？我今天突然解决了，就是在配置文件那里增加参数，指定GPU加载多少层，PARAMETER num_gpu 100，我现在都是这样设置，这样一般来说，他都会加载满，当然，如果你的模型很大，那么，你需要再增加这个层数。

GiteaMirror commented

2026-04-28 19:07:44 -05:00

@CalunVier commented on GitHub (May 10, 2025):

我也遇到了类似的情况，但我是a卡6800xt，明明有16g专用显存，占用总显存14g的情况下，非要9g加载进共享显存，5g加载进专用显存，导致速度很慢，请问你换模型解决问题了吗？

我今天突然解决了，就是在配置文件那里增加参数，指定GPU加载多少层，PARAMETER num_gpu 100，我现在都是这样设置，这样一般来说，他都会加载满，当然，如果你的模型很大，那么，你需要再增加这个层数。

您好，请问您能否说的再详细一些？

@CalunVier commented on GitHub (May 10, 2025): > > 我也遇到了类似的情况，但我是a卡6800xt，明明有16g专用显存，占用总显存14g的情况下，非要9g加载进共享显存，5g加载进专用显存，导致速度很慢，请问你换模型解决问题了吗？ > > 我今天突然解决了，就是在配置文件那里增加参数，指定GPU加载多少层，PARAMETER num_gpu 100，我现在都是这样设置，这样一般来说，他都会加载满，当然，如果你的模型很大，那么，你需要再增加这个层数。您好，请问您能否说的再详细一些？

GiteaMirror commented

2026-04-28 19:07:45 -05:00

@LZHLZHOOO commented on GitHub (Sep 14, 2025):

我也遇到了类似的情况，但我是a卡6800xt，明明有16g专用显存，占用总显存14g的情况下，非要9g加载进共享显存，5g加载进专用显存，导致速度很慢，请问你换模型解决问题了吗？

我今天突然解决了，就是在配置文件那里增加参数，指定GPU加载多少层，PARAMETER num_gpu 100，我现在都是这样设置，这样一般来说，他都会加载满，当然，如果你的模型很大，那么，你需要再增加这个层数。

遇到了同样的问题，请问具体是修改哪一个配置文件

@LZHLZHOOO commented on GitHub (Sep 14, 2025): > > 我也遇到了类似的情况，但我是a卡6800xt，明明有16g专用显存，占用总显存14g的情况下，非要9g加载进共享显存，5g加载进专用显存，导致速度很慢，请问你换模型解决问题了吗？ > > 我今天突然解决了，就是在配置文件那里增加参数，指定GPU加载多少层，PARAMETER num_gpu 100，我现在都是这样设置，这样一般来说，他都会加载满，当然，如果你的模型很大，那么，你需要再增加这个层数。遇到了同样的问题，请问具体是修改哪一个配置文件

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#51266