[GH-ISSUE #11772] use cpu to offload moe weights to reduce the VRAM usage. #69860

Open
opened 2026-05-04 19:35:53 -05:00 by GiteaMirror · 31 comments
Owner

Originally created by @Readon on GitHub (Aug 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11772

Now ggml-org/llama.cpp#15077 already support to load model with moe layers into CPU to reduce the VRAM usage.
How about enable this in ollama by strategy, which maybe defined by macro such as OLLAMA_MOE_OFFLOAD to one of FULL, PARTIAL, NONE.
Where partial means as much as possible.

Originally created by @Readon on GitHub (Aug 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11772 Now ggml-org/llama.cpp#15077 already support to load model with moe layers into CPU to reduce the VRAM usage. How about enable this in ollama by strategy, which maybe defined by macro such as OLLAMA_MOE_OFFLOAD to one of FULL, PARTIAL, NONE. Where partial means as much as possible.
GiteaMirror added the feature request label 2026-05-04 19:35:53 -05:00
Author
Owner

@ghost commented on GitHub (Aug 7, 2025):

  • +1 I would rather use an integer instead of full, partial, none, etc. Gives more fine grain control. This is exactly what stopped me from using Ollama and all those nice MoE models.
<!-- gh-comment-id:3165720738 --> @ghost commented on GitHub (Aug 7, 2025): - [x] +1 I would rather use an integer instead of full, partial, none, etc. Gives more fine grain control. This is exactly what stopped me from using Ollama and all those nice MoE models.
Author
Owner

@mike-fischer-ml commented on GitHub (Aug 10, 2025):

This would be really helpful for running the moe models.

<!-- gh-comment-id:3172572720 --> @mike-fischer-ml commented on GitHub (Aug 10, 2025): This would be really helpful for running the moe models.
Author
Owner

@draplater commented on GitHub (Aug 13, 2025):

+1

<!-- gh-comment-id:3182191679 --> @draplater commented on GitHub (Aug 13, 2025): +1
Author
Owner

@hg0428 commented on GitHub (Aug 15, 2025):

+1

<!-- gh-comment-id:3192464707 --> @hg0428 commented on GitHub (Aug 15, 2025): +1
Author
Owner

@coder543 commented on GitHub (Aug 17, 2025):

I would like to provide some concrete numbers here.

System specs: AMD R9 7950X + 64GB RAM + RTX 3090.

Using ollama with gpt-oss:120b and 16k context, I get about 8.5 tok/s average.

Using llama.cpp with this command:

$ ./llama-server \
    -m gpt-oss-120b-F16.gguf \
    -c 16384 -ngl 999 \
    --flash-attn \
    --cont-batching \
    --jinja \
    --n-cpu-moe 24

(Note: 'F16' is just what unsloth calls the MXFP4 model weights for some reason. The file is 61GB.)

I get this outcome:

prompt eval time =    5987.98 ms /    73 tokens (   82.03 ms per token,    12.19 tokens per second)
       eval time =   31971.01 ms /   940 tokens (   34.01 ms per token,    29.40 tokens per second)
      total time =   37958.99 ms /  1013 tokens

This is nearly 3.5x the performance, for the same model using the same quantization.

<!-- gh-comment-id:3194476162 --> @coder543 commented on GitHub (Aug 17, 2025): I would like to provide some concrete numbers here. System specs: AMD R9 7950X + 64GB RAM + RTX 3090. Using ollama with `gpt-oss:120b` and 16k context, I get about 8.5 tok/s average. Using llama.cpp with this command: ```bash $ ./llama-server \ -m gpt-oss-120b-F16.gguf \ -c 16384 -ngl 999 \ --flash-attn \ --cont-batching \ --jinja \ --n-cpu-moe 24 ``` (Note: 'F16' is just what unsloth calls the MXFP4 model weights for some reason. The file is 61GB.) I get this outcome: ``` prompt eval time = 5987.98 ms / 73 tokens ( 82.03 ms per token, 12.19 tokens per second) eval time = 31971.01 ms / 940 tokens ( 34.01 ms per token, 29.40 tokens per second) total time = 37958.99 ms / 1013 tokens ``` This is nearly 3.5x the performance, for the same model using the same quantization.
Author
Owner

@abotsis commented on GitHub (Aug 18, 2025):

What about exposing the parameter via a Modelfile? I’d prefer to be able to configure it per-model vs globally for everything. The number of expert layers are also model specific, so it makes sense to put it there if you plan on running more than one model.

<!-- gh-comment-id:3195144666 --> @abotsis commented on GitHub (Aug 18, 2025): What about exposing the parameter via a Modelfile? I’d prefer to be able to configure it per-model vs globally for everything. The number of expert layers are also model specific, so it makes sense to put it there if you plan on running more than one model.
Author
Owner

@TheSpaceGod commented on GitHub (Aug 21, 2025):

This would be a huge performance benefit for people running smaller MOE models on consumer GPUs with 16GB of VRAM or less. It would make running ~30b MOE models (hopefully) more reasonable, or gpt-oss:20b with a larger context window. I have been searching for any possible performance hacks for running that model on 1 consumer GPU, and I think this might be the best possible one.

<!-- gh-comment-id:3210515371 --> @TheSpaceGod commented on GitHub (Aug 21, 2025): This would be a huge performance benefit for people running smaller MOE models on consumer GPUs with 16GB of VRAM or less. It would make running ~30b MOE models (hopefully) more reasonable, or gpt-oss:20b with a larger context window. I have been searching for any possible performance hacks for running that model on 1 consumer GPU, and I think this might be the best possible one.
Author
Owner

@nkuhn-vmw commented on GitHub (Aug 22, 2025):

+1 - this would be extremely beneficial -- as MoE models become more and more popular.

<!-- gh-comment-id:3214156208 --> @nkuhn-vmw commented on GitHub (Aug 22, 2025): +1 - this would be extremely beneficial -- as MoE models become more and more popular.
Author
Owner

@TheSpaceGod commented on GitHub (Aug 22, 2025):

Hi @jmorganca,

This feature could be HUGE for many ollama user's running on client GPUs with 8-16GB of VRAM. Please consider the merits of this ticket, especially when gpt-oss seems to be the new premier ollama model with its lowest weight count being 20b. This becomes increasingly important as open source agentic coding tools require larger model sizes like continue.dev .

Thanks!

<!-- gh-comment-id:3215783194 --> @TheSpaceGod commented on GitHub (Aug 22, 2025): Hi @jmorganca, This feature could be HUGE for many ollama user's running on client GPUs with 8-16GB of VRAM. Please consider the merits of this ticket, especially when gpt-oss seems to be the new premier ollama model with its lowest weight count being 20b. This becomes increasingly important as open source agentic coding tools require larger model sizes like continue.dev . Thanks!
Author
Owner

@LarsKort commented on GitHub (Sep 1, 2025):

+1
This is very useful feature. And perfect if we have full layer configuration control.
In my case I have cmp90hx + 4x 102-100. And I want move max attn layers to CMP (as it is much faster than others).

Now on setup 2600X + 32Gb 3200Mhz DDR4 duoble-chan + (cmp90hx + 4x 102-100) with gpt-oss:120b
total duration: 8.60550693s
load duration: 243.442116ms
prompt eval count: 68 token(s)
prompt eval duration: 2.868437188s
prompt eval rate: 23.71 tokens/s
eval count: 51 token(s)
eval duration: 5.49283703s
eval rate: 9.28 tokens/s

Thank you ollama team for your work!

<!-- gh-comment-id:3240944464 --> @LarsKort commented on GitHub (Sep 1, 2025): +1 This is very useful feature. And perfect if we have full layer configuration control. In my case I have cmp90hx + 4x 102-100. And I want move max attn layers to CMP (as it is much faster than others). Now on setup 2600X + 32Gb 3200Mhz DDR4 duoble-chan + (cmp90hx + 4x 102-100) with gpt-oss:120b total duration: 8.60550693s load duration: 243.442116ms prompt eval count: 68 token(s) prompt eval duration: 2.868437188s prompt eval rate: 23.71 tokens/s eval count: 51 token(s) eval duration: 5.49283703s eval rate: 9.28 tokens/s Thank you ollama team for your work!
Author
Owner

@mcheninfotech commented on GitHub (Sep 16, 2025):

+1: please add this feature into Ollama!

<!-- gh-comment-id:3300376357 --> @mcheninfotech commented on GitHub (Sep 16, 2025): +1: please add this feature into Ollama!
Author
Owner

@asmelko commented on GitHub (Sep 18, 2025):

+1

<!-- gh-comment-id:3307891599 --> @asmelko commented on GitHub (Sep 18, 2025): +1
Author
Owner

@inforithmics commented on GitHub (Oct 12, 2025):

I played a little bit around with Ollama Sourcecode and managed to offload all experts to the cpu by adding following patch.

diff --git a/ml/backend/ggml/ggml.go b/ml/backend/ggml/ggml.go
index 07e55dd3..4ec63aa2 100644
--- a/ml/backend/ggml/ggml.go
+++ b/ml/backend/ggml/ggml.go
@@ -323,6 +323,10 @@ func New(modelPath string, params ml.BackendParams) (ml.Backend, error) {
 					target: "blk." + strconv.Itoa(i) + "." + t.Name,
 				}, layer.bts, i)
 			}
+		case strings.Contains(t.Name, "_exps"):
+			slog.Info("Loaded Tensor on CPU: ", t.Name)
+			// MoE expert weights can be very large; keep them on CPU to save GPU memory
+			createTensor(tensor{source: t}, input.bts, -1)
 		default:
 			layerIndex := -1
 			if fields := strings.FieldsFunc(t.Name, func(r rune) bool { return !unicode.IsNumber(r) }); len(fields) > 0 {
-- 

0001-all-experts-on-CPU.patch

@jessegross is this the right location to manipulate Where the Tensor is loaded in the OllamaEngine?

Things ToDo:

  • Make it configuratble.
  • Adapt Memory Usage Calculation and Layer Offloading Logic.
  • Load as many Experts as possible to the GPU when all Layers are already on the GPU.
<!-- gh-comment-id:3394386625 --> @inforithmics commented on GitHub (Oct 12, 2025): I played a little bit around with Ollama Sourcecode and managed to offload all experts to the cpu by adding following patch. ``` diff --git a/ml/backend/ggml/ggml.go b/ml/backend/ggml/ggml.go index 07e55dd3..4ec63aa2 100644 --- a/ml/backend/ggml/ggml.go +++ b/ml/backend/ggml/ggml.go @@ -323,6 +323,10 @@ func New(modelPath string, params ml.BackendParams) (ml.Backend, error) { target: "blk." + strconv.Itoa(i) + "." + t.Name, }, layer.bts, i) } + case strings.Contains(t.Name, "_exps"): + slog.Info("Loaded Tensor on CPU: ", t.Name) + // MoE expert weights can be very large; keep them on CPU to save GPU memory + createTensor(tensor{source: t}, input.bts, -1) default: layerIndex := -1 if fields := strings.FieldsFunc(t.Name, func(r rune) bool { return !unicode.IsNumber(r) }); len(fields) > 0 { -- ``` [0001-all-experts-on-CPU.patch](https://github.com/user-attachments/files/22871507/0001-all-experts-on-CPU.patch) @jessegross is this the right location to manipulate Where the Tensor is loaded in the OllamaEngine? Things ToDo: - [ ] Make it configuratble. - [ ] Adapt Memory Usage Calculation and Layer Offloading Logic. - [ ] Load as many Experts as possible to the GPU when all Layers are already on the GPU.
Author
Owner

@TheSpaceGod commented on GitHub (Oct 13, 2025):

Does this problem get solved in-part for MoE models like Qwen3 now running on the new Ollama LLM engine? I recently updated my Ollama version to the lates docker release and my Qwen3 models seem to be using less VRAM with the same context window than they have in the past AND when I expand the context window past the VRAM limits and some layers spill to the CPU, it does not seem to totally nuke performance anymore (getting ~30% GPU utilization instead of near ~0%). IDK what specific changes have happened in roughly the last month, but MoE models seem to be running way better on the Ollama LLM engine.

<!-- gh-comment-id:3398544371 --> @TheSpaceGod commented on GitHub (Oct 13, 2025): Does this problem get solved in-part for MoE models like Qwen3 now running on the new Ollama LLM engine? I recently updated my Ollama version to the lates docker release and my Qwen3 models seem to be using less VRAM with the same context window than they have in the past AND when I expand the context window past the VRAM limits and some layers spill to the CPU, it does not seem to totally nuke performance anymore (getting ~30% GPU utilization instead of near ~0%). IDK what specific changes have happened in roughly the last month, but MoE models seem to be running way better on the Ollama LLM engine.
Author
Owner

@jessegross commented on GitHub (Oct 14, 2025):

@inforithmics Yes, that looks like the right place.

<!-- gh-comment-id:3403856294 --> @jessegross commented on GitHub (Oct 14, 2025): @inforithmics Yes, that looks like the right place.
Author
Owner

@inforithmics commented on GitHub (Oct 15, 2025):

I hard coded the moe offloading to see if it is useful.
For integrated GPUs it was a negative Performance improvement. Maybe for GPUs when there is a big Performance Gap it is usefull.
MOE 19 token per second. (All Layers GPU and 36 Expert Layers on GPU and 13 Expert Layers on CPU)
without MOE 24 token per second. (38 Layer Gpu 11 Layer CPU)

<!-- gh-comment-id:3406701985 --> @inforithmics commented on GitHub (Oct 15, 2025): I hard coded the moe offloading to see if it is useful. For integrated GPUs it was a negative Performance improvement. Maybe for GPUs when there is a big Performance Gap it is usefull. MOE 19 token per second. (All Layers GPU and 36 Expert Layers on GPU and 13 Expert Layers on CPU) without MOE 24 token per second. (38 Layer Gpu 11 Layer CPU)
Author
Owner

@mlgitter commented on GitHub (Nov 6, 2025):

@inforithmics Does it behave same way with llama.cpp?

May be, the feature deserves some configuration, like in llama.cpp?

<!-- gh-comment-id:3494434374 --> @mlgitter commented on GitHub (Nov 6, 2025): @inforithmics Does it behave same way with llama.cpp? May be, the feature deserves some configuration, like in llama.cpp?
Author
Owner

@hg0428 commented on GitHub (Dec 24, 2025):

Update on this?

<!-- gh-comment-id:3690057148 --> @hg0428 commented on GitHub (Dec 24, 2025): Update on this?
Author
Owner

@v8v8v commented on GitHub (Feb 5, 2026):

+1 from me, this would be a big win for users.
Would be amazing to have this built in someday.

<!-- gh-comment-id:3856360290 --> @v8v8v commented on GitHub (Feb 5, 2026): +1 from me, this would be a big win for users. Would be amazing to have this built in someday.
Author
Owner

@resynth commented on GitHub (Feb 18, 2026):

+1 as I have more than 3x throughput using --n-cpu-moe in koboldcpp and lmstudio than without it in Ollama.
I have 16GB VRAM making running Qwen3 Coder Next 80B marginal with around 10t/s generation.
Setting full GPU offload but then tweaking --n-cpu-moe gives me 32t/s generation!

<!-- gh-comment-id:3920413250 --> @resynth commented on GitHub (Feb 18, 2026): +1 as I have more than 3x throughput using `--n-cpu-moe` in koboldcpp and lmstudio than without it in Ollama. I have 16GB VRAM making running Qwen3 Coder Next 80B marginal with around 10t/s generation. Setting full GPU offload but then tweaking `--n-cpu-moe` gives me 32t/s generation!
Author
Owner

@jimb0bb commented on GitHub (Feb 24, 2026):

+1 I can get up to 5.3x speed up with 1.5x context using llama.cpp with proper MoE cpu offload running gpt oss 20b on my 4070 + Ryzen 7600. Giving me 50 tk/s with 64k context window as opposed to 9.5tk/s with 35k context window.

<!-- gh-comment-id:3950406593 --> @jimb0bb commented on GitHub (Feb 24, 2026): +1 I can get up to 5.3x speed up with 1.5x context using llama.cpp with proper MoE cpu offload running gpt oss 20b on my 4070 + Ryzen 7600. Giving me 50 tk/s with 64k context window as opposed to 9.5tk/s with 35k context window.
Author
Owner

@TTDiang2 commented on GitHub (Feb 27, 2026):

+1

<!-- gh-comment-id:3973396715 --> @TTDiang2 commented on GitHub (Feb 27, 2026): +1
Author
Owner

@MarkMuravev commented on GitHub (Mar 6, 2026):

+1

<!-- gh-comment-id:4011783089 --> @MarkMuravev commented on GitHub (Mar 6, 2026): +1
Author
Owner

@ljlabs commented on GitHub (Mar 20, 2026):

+1

<!-- gh-comment-id:4096220352 --> @ljlabs commented on GitHub (Mar 20, 2026): +1
Author
Owner

@mohaljifri commented on GitHub (Mar 22, 2026):

+1

This feature is critical as more models adopting moe, alot of coders have 8gb vrams and even with high-end gpu have a max 32gb vram per card which will not be able to load full models in large sizes. Also current percentage loading cpu/gpu will not make model load as intended from its developer. therefore enabling moe offloading becomes a most to use an intelligent model for coding and usability

@Readon did a pull request #12333 for implemnting it as parameter

Can we reconsider the pull request and merge it as expermental feature

<!-- gh-comment-id:4105764896 --> @mohaljifri commented on GitHub (Mar 22, 2026): +1 This feature is critical as more models adopting moe, alot of coders have 8gb vrams and even with high-end gpu have a max 32gb vram per card which will not be able to load full models in large sizes. Also current percentage loading cpu/gpu will not make model load as intended from its developer. therefore enabling moe offloading becomes a most to use an intelligent model for coding and usability @Readon did a pull request #12333 for implemnting it as parameter Can we reconsider the pull request and merge it as expermental feature
Author
Owner

@redstefan1 commented on GitHub (Mar 23, 2026):

Fully agree. Allowing the user to offload some or all of the moe layers to system ram is a great way to allow for running larger, smarter models on older or mobile GPUs that could normally only handle a few billion parameters at best, making running local ai coding tools or agents like openclaw a lot more viable for people that rely on older hardware or laptops.

(Especially considering that upgrading system ram is a whole lot easier than upgrading vram, making the barrier to entry for turning your old gaming or workstation pc into a local ai server that much easier to cross.)

+1

This feature is critical as more models adopting moe, alot of coders have 8gb vrams and even with high-end gpu have a max 32gb vram per card which will not be able to load full models in large sizes. Also current percentage loading cpu/gpu will not make model load as intended from its developer. therefore enabling moe offloading becomes a most to use an intelligent model for coding and usability

@Readon did a pull request #12333 for implemnting it as parameter

Can we reconsider the pull request and merge it as expermental feature

<!-- gh-comment-id:4108788855 --> @redstefan1 commented on GitHub (Mar 23, 2026): Fully agree. Allowing the user to offload some or all of the moe layers to system ram is a great way to allow for running larger, smarter models on older or mobile GPUs that could normally only handle a few billion parameters at best, making running local ai coding tools or agents like openclaw a lot more viable for people that rely on older hardware or laptops. (Especially considering that upgrading system ram is a whole lot easier than upgrading vram, making the barrier to entry for turning your old gaming or workstation pc into a local ai server that much easier to cross.) > +1 > > This feature is critical as more models adopting moe, alot of coders have 8gb vrams and even with high-end gpu have a max 32gb vram per card which will not be able to load full models in large sizes. Also current percentage loading cpu/gpu will not make model load as intended from its developer. therefore enabling moe offloading becomes a most to use an intelligent model for coding and usability > > [@Readon](https://github.com/Readon) did a pull request [#12333](https://github.com/ollama/ollama/pull/12333) for implemnting it as parameter > > Can we reconsider the pull request and merge it as expermental feature
Author
Owner

@kokroo commented on GitHub (Mar 26, 2026):

It's hilarious that a simple flag is not being allowed. This would allow power users to configure it themselves.

+1

<!-- gh-comment-id:4134246610 --> @kokroo commented on GitHub (Mar 26, 2026): It's hilarious that a simple flag is not being allowed. This would allow power users to configure it themselves. +1
Author
Owner

@sf1tzp commented on GitHub (Apr 2, 2026):

I took a stab at this based on the feedback in #12333. However, on my 12GB 3060, I did not get any better performance compared to main. Probably because I have no idea what I'm doing really. But the idea is interesting! I'd love to squeeze some more performance out of that card if possible.

<!-- gh-comment-id:4178341418 --> @sf1tzp commented on GitHub (Apr 2, 2026): I took a stab at this based on the feedback in #12333. However, on my 12GB 3060, I did not get any better performance compared to main. Probably because I have no idea what I'm doing really. But the idea is interesting! I'd love to squeeze some more performance out of that card if possible.
Author
Owner

@redstefan1 commented on GitHub (Apr 27, 2026):

has anybody managed to make any progress on this?
It seems to me like the basic foundation of this feature already exists in #12333, just the automatic configuration based on available vram doesn't. (assuming I understood the main point of this comment correctly)

I wonder if the straight-forward solution of "load as many layers to the gpu as possible, the rest gets offloaded to cpu" works?

Anybody got the time and willingness to implement that sorta strategy? (I would try it myself but sadly do not have the time to get familiar with ollama's source code and give implementing that a shot)

<!-- gh-comment-id:4330243858 --> @redstefan1 commented on GitHub (Apr 27, 2026): has anybody managed to make any progress on this? It seems to me like the basic foundation of this feature already exists in #12333, just the automatic configuration based on available vram doesn't. [(assuming I understood the main point of this comment correctly)](https://github.com/ollama/ollama/pull/12333#issuecomment-3308548339) I wonder if the straight-forward solution of "load as many layers to the gpu as possible, the rest gets offloaded to cpu" works? Anybody got the time and willingness to implement that sorta strategy? (I would try it myself but sadly do not have the time to get familiar with ollama's source code and give implementing that a shot)
Author
Owner

@Menooker commented on GitHub (May 2, 2026):

I end up using llama.cpp and give up on ollama. ollama is great for OOB experience and extremely easy to use. But I am increasingly disappointed as the ollama team keeps ignoring the features that really matters and put resources on fancy features like agent-integration for codex openclaw, or cloud models. However features like moe cpu offloading is what I really care, since recently gemma4 and qwen3.6 are released with MoE. I believe many users are sharing the same feelings, as most of the users are using commodity GPUs with no more than 16GB Vram. This issue is pending for months and the community even had some PR for this. But the team still won’t land it!

<!-- gh-comment-id:4363025309 --> @Menooker commented on GitHub (May 2, 2026): I end up using llama.cpp and give up on ollama. ollama is great for OOB experience and extremely easy to use. But I am increasingly disappointed as the ollama team keeps ignoring the features that really matters and put resources on fancy features like agent-integration for codex openclaw, or cloud models. However features like moe cpu offloading is what I really care, since recently gemma4 and qwen3.6 are released with MoE. I believe many users are sharing the same feelings, as most of the users are using commodity GPUs with no more than 16GB Vram. This issue is pending for months and the community even had some PR for this. But the team still won’t land it!
Author
Owner

@resynth commented on GitHub (May 2, 2026):

I've also left Ollama in favour of more performative alternatives. ik_llama.cpp being the best in my opinion.
I also originally used Ollama because it was the most recommended for ease of use but llama.cpp / ik_llama.cpp are just as easy, faster, have more active development, better tool calling and concentrate on running LLMs locally rather than the cloud direction Ollama is going in.

<!-- gh-comment-id:4363528703 --> @resynth commented on GitHub (May 2, 2026): I've also left Ollama in favour of more performative alternatives. ik_llama.cpp being the best in my opinion. I also originally used Ollama because it was the most recommended for ease of use but llama.cpp / ik_llama.cpp are just as easy, faster, have more active development, better tool calling and concentrate on running LLMs locally rather than the cloud direction Ollama is going in.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69860