[GH-ISSUE #10030] Deepseek R1, 671b is faster than 70b #68634

New Issue

GiteaMirror · 2026-05-04T14:39:48-05:00

GiteaMirror commented

2026-05-04 14:39:48 -05:00

Originally created by @fanlessfan on GitHub (Mar 28, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10030

Hello,

I tried to run deepseek-r1 using CPU only with dual Xeon 6138 with 768GB memory. the result is 671b (1.74t/s) is faster than 70b (1.53t/s), even though 671b model takes a longer time. I also tried r1-1777 671b-q8 (713GB) model and it's (1.29t/s) and not slow that much.

Could anyone explain it?

thx

Originally created by @fanlessfan on GitHub (Mar 28, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10030 Hello, I tried to run deepseek-r1 using CPU only with dual Xeon 6138 with 768GB memory. the result is 671b (1.74t/s) is faster than 70b (1.53t/s), even though 671b model takes a longer time. I also tried r1-1777 671b-q8 (713GB) model and it's (1.29t/s) and not slow that much. Could anyone explain it? thx ![Image](https://github.com/user-attachments/assets/29a58daa-a829-4810-bd9c-968781e1a1eb) ![Image](https://github.com/user-attachments/assets/93f9001c-6f46-488d-adb1-f52b0a8614cb) ![Image](https://github.com/user-attachments/assets/653c150b-b438-4588-8754-2fa69ec15268)

GiteaMirror added the performance label 2026-05-04 14:39:48 -05:00

GiteaMirror commented

2026-05-04 14:39:50 -05:00

@rick-github commented on GitHub (Mar 28, 2025):

Multi CPU/NUMA systems can suffer from thread contention (#2936, #8074, #10022). Try reducing the thread count and see if performance changes.

@rick-github commented on GitHub (Mar 28, 2025): Multi CPU/NUMA systems can suffer from thread contention (#2936, #8074, #10022). Try reducing the [thread count](https://github.com/ollama/ollama/issues/9857#issuecomment-2758868781) and see if performance changes.

GiteaMirror commented

2026-05-04 14:39:51 -05:00

@fanlessfan commented on GitHub (Mar 28, 2025):

I think there is something else affect the result. the 70b model is distilled llama 70b model and I run the 70b llama is the same as deepseek distilled 70 model, but 671b model is deepseek itself, which is faster. the deepseek 713GB model should be around half the speed of deepseek 404GB model, but only 25% slow 1.29/1.74=75%.

anyone can explain it?

@fanlessfan commented on GitHub (Mar 28, 2025): I think there is something else affect the result. the 70b model is distilled llama 70b model and I run the 70b llama is the same as deepseek distilled 70 model, but 671b model is deepseek itself, which is faster. the deepseek 713GB model should be around half the speed of deepseek 404GB model, but only 25% slow 1.29/1.74=75%. anyone can explain it?

GiteaMirror commented

2026-05-04 14:39:52 -05:00

@fighter3005 commented on GitHub (Mar 28, 2025):

Idk about Q4 vs Q8, but 70B should be slower than 671B, since the 671B only has 38B active paramerters. 671B takes a lot longer to load, obviously. Maybe it took longer, because it used more tokens for thinking?

When I test llama3.2-vision-11B Q4 vs Q8 I also see roughly 25% uplift. Dunno if that is normal though. (RTX 3090)

@fighter3005 commented on GitHub (Mar 28, 2025): Idk about Q4 vs Q8, but 70B should be slower than 671B, since the 671B only has 38B active paramerters. 671B takes a lot longer to load, obviously. Maybe it took longer, because it used more tokens for thinking? When I test llama3.2-vision-11B Q4 vs Q8 I also see roughly 25% uplift. Dunno if that is normal though. (RTX 3090)

GiteaMirror commented

2026-05-04 14:39:53 -05:00

@fanlessfan commented on GitHub (Mar 28, 2025):

Is there any benefit for 70b than 671b except load faster and use less memory?

@fanlessfan commented on GitHub (Mar 28, 2025): Is there any benefit for 70b than 671b except load faster and use less memory?

GiteaMirror commented

2026-05-04 14:39:54 -05:00

@fighter3005 commented on GitHub (Mar 29, 2025):

Again, not an Expert, but since the 671B model is HUGE, it does not fit on many systems. In that case, 70B would be beneficial. Especially, if you don't have a couple 100gb vram laying around. But since you can fit 671B in memory, I guess there is no reason to use the 70B model, other than running multiple instances or maybe longer context, etc.

@fighter3005 commented on GitHub (Mar 29, 2025): Again, not an Expert, but since the 671B model is HUGE, it does not fit on many systems. In that case, 70B would be beneficial. Especially, if you don't have a couple 100gb vram laying around. But since you can fit 671B in memory, I guess there is no reason to use the 70B model, other than running multiple instances or maybe longer context, etc.

GiteaMirror commented

2026-05-04 14:39:56 -05:00

@NGC13009 commented on GitHub (Mar 29, 2025):

70b是蒸馏的llama，GPU需要实打实的计算70b的各种运算。
671b是deepseek的MoE架构，推理时实际每个token激活37b左右，所以虽然它很大，但是实际运算的时候，速度是37b模型的速度，你应该对标接近37b的dense模型的速度，比如qwen：32b，就差不多了。

@NGC13009 commented on GitHub (Mar 29, 2025): 70b是蒸馏的llama，GPU需要实打实的计算70b的各种运算。 671b是deepseek的MoE架构，推理时实际每个token激活37b左右，所以虽然它很大，但是实际运算的时候，速度是37b模型的速度，你应该对标接近37b的dense模型的速度，比如qwen：32b，就差不多了。

GiteaMirror commented

2026-05-04 14:39:58 -05:00

@navr32 commented on GitHub (Mar 29, 2025):

Do you have avx512 enable ?

@navr32 commented on GitHub (Mar 29, 2025): Do you have avx512 enable ?

GiteaMirror commented

2026-05-04 14:40:00 -05:00

@fanlessfan commented on GitHub (Mar 29, 2025):

@navr32 I just installed standard ollama on ubuntu server. Is there any command that I can check the AVX512?

@fanlessfan commented on GitHub (Mar 29, 2025): @navr32 I just installed standard ollama on ubuntu server. Is there any command that I can check the AVX512?

GiteaMirror commented

2026-05-04 14:40:02 -05:00

@rick-github commented on GitHub (Mar 29, 2025):

lscpu | grep avx512

@rick-github commented on GitHub (Mar 29, 2025): ``` lscpu | grep avx512 ```

GiteaMirror commented

2026-05-04 14:40:06 -05:00

@fanlessfan commented on GitHub (Mar 29, 2025):

@navr32 does the output mean avx512 enabled? thx

lscpu | grep avx512
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke md_clear flush_l1d arch_capabilities

@fanlessfan commented on GitHub (Mar 29, 2025): @navr32 does the output mean avx512 enabled? thx lscpu | grep avx512 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke md_clear flush_l1d arch_capabilities

GiteaMirror commented

2026-05-04 14:40:08 -05:00

@fanlessfan commented on GitHub (Mar 29, 2025):

@NGC13009 怎么知道推理时实际每个token激活37b。qwen 是2.83t/s。 r1-1776:671b-q4_K_M是1.8t/s。

@fanlessfan commented on GitHub (Mar 29, 2025): @NGC13009 怎么知道推理时实际每个token激活37b。[qwen](qwq:32b-q4_K_M) 是2.83t/s。 r1-1776:671b-q4_K_M是1.8t/s。

GiteaMirror commented

2026-05-04 14:40:09 -05:00

@mrdg-sys commented on GitHub (Apr 3, 2025):

Hi,

I have some insight regarding your dual cpu system and few tricks to speed-up cpu only inference results.
My system is very similar to yours: dual 6138 xeons with avx512 and 385GB of RAM, no gpu. My ram configuration is specific for maximum memory throughput: one dimm / memory channel and 12 dimms total between 2 cpu x 6 memory channel / cpu.

With this configuration I am able to get an average of 3 tokens/second cpu only inference. This is the best result we can achieve without NUMA support from Ollama. If later this support is introduced in the software then perhaps we can reach 5 t/s with our dual cpu systems.

I did try another memory configuration where I fill every available memory dimm slot on my motherboard, hence doubling my total available ram, however this resulted in slower inference speed (down to 2 t/s) because RAM speed downgrade from 2666mhz to 2133mhz when all slots filled.

@mrdg-sys commented on GitHub (Apr 3, 2025): Hi, I have some insight regarding your dual cpu system and few tricks to speed-up cpu only inference results. My system is very similar to yours: dual 6138 xeons with avx512 and 385GB of RAM, no gpu. My ram configuration is specific for maximum memory throughput: one dimm / memory channel and 12 dimms total between 2 cpu x 6 memory channel / cpu. With this configuration I am able to get an average of 3 tokens/second cpu only inference. This is the best result we can achieve without NUMA support from Ollama. If later this support is introduced in the software then perhaps we can reach 5 t/s with our dual cpu systems. I did try another memory configuration where I fill every available memory dimm slot on my motherboard, hence doubling my total available ram, however this resulted in slower inference speed (down to 2 t/s) because RAM speed downgrade from 2666mhz to 2133mhz when all slots filled.

GiteaMirror commented

2026-05-04 14:40:11 -05:00

@fanlessfan commented on GitHub (Apr 4, 2025):

Hi @mrdg-sys,

I think we have the exactly same config except I have 64GBx12=768GB RAM. I have X11DPH-T motherboard. how did you get 3 tokens/s?

here is my memory speed.

Intel(R) Memory Latency Checker - v3.11b
*** Unable to modify prefetchers (try executing 'modprobe msr')
*** So, enabling random access for latency measurements
Measuring idle latencies for random access (in ns)...
Numa node
Numa node 0 1
0 88.5 142.1
1 142.2 84.1

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 216944.3
3:1 Reads-Writes : 204116.6
2:1 Reads-Writes : 203703.3
1:1 Reads-Writes : 190054.8
Stream-triad like: 178900.1

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Numa node
Numa node 0 1
0 108828.8 51099.7
1 51098.2 108531.9

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec

00000 200.80 217166.8
00002 201.15 217055.5
00008 201.34 217105.4
00015 200.94 216880.5
00050 199.42 214537.3
00100 142.04 192391.5
00200 110.08 120579.6
00300 104.14 83600.2
00400 97.67 64017.2
00500 95.84 51905.7
00700 93.87 37740.0
01000 92.05 26926.5
01300 91.15 20951.5
01700 90.39 16274.2
02500 89.70 11345.9
03500 89.33 8325.9
05000 88.99 6057.5
09000 88.94 3694.7
20000 87.68 2068.5

thx

@fanlessfan commented on GitHub (Apr 4, 2025): Hi @mrdg-sys, I think we have the exactly same config except I have 64GBx12=768GB RAM. I have X11DPH-T motherboard. how did you get 3 tokens/s? here is my memory speed. Intel(R) Memory Latency Checker - v3.11b *** Unable to modify prefetchers (try executing 'modprobe msr') *** So, enabling random access for latency measurements Measuring idle latencies for random access (in ns)... Numa node Numa node 0 1 0 88.5 142.1 1 142.2 84.1 Measuring Peak Injection Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 216944.3 3:1 Reads-Writes : 204116.6 2:1 Reads-Writes : 203703.3 1:1 Reads-Writes : 190054.8 Stream-triad like: 178900.1 Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Numa node Numa node 0 1 0 108828.8 51099.7 1 51098.2 108531.9 Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 200.80 217166.8 00002 201.15 217055.5 00008 201.34 217105.4 00015 200.94 216880.5 00050 199.42 214537.3 00100 142.04 192391.5 00200 110.08 120579.6 00300 104.14 83600.2 00400 97.67 64017.2 00500 95.84 51905.7 00700 93.87 37740.0 01000 92.05 26926.5 01300 91.15 20951.5 01700 90.39 16274.2 02500 89.70 11345.9 03500 89.33 8325.9 05000 88.99 6057.5 09000 88.94 3694.7 20000 87.68 2068.5 thx

GiteaMirror commented

2026-05-04 14:40:13 -05:00

@mrdg-sys commented on GitHub (Apr 4, 2025):

I think I can spot your issue already it has to do with your memory bandwidth result:

ALL Reads : 216944.3

to achieve this result your motherboard's bios memory settings have by default NUMA nodes enabled therefore resulting in maximum memory bandwidth by splitting workload between dimms and cpu cores... this configuration actually slow down cpu llm inference. To improve you need to disable all NUMA settings in your motherboard bios.

My server is Fujitsu RX2530 M4 and my bios allows me to disable NUMA. In your case you may need to change NUMA from 2-way to 1-way or 0-way configurations if disable option not available.

Upon reconfiguring your bios run another memory test and look for half memory bandwidth result, so something like ALL READS ~110000

With that result you can be sure that NUMA is now disabled.

Even though your memory throughput is now halved (in theory) your actual llm inference will increase due to cpu core's unrestricted access to all available memory, not clustered.

Also for me I leave hyperthreading enabled so I can do other tasks on the server while running llm inference.

@mrdg-sys commented on GitHub (Apr 4, 2025): I think I can spot your issue already it has to do with your memory bandwidth result: ALL Reads : 216944.3 to achieve this result your motherboard's bios memory settings have by default NUMA nodes enabled therefore resulting in maximum memory bandwidth by splitting workload between dimms and cpu cores... this configuration actually slow down cpu llm inference. To improve you need to disable all NUMA settings in your motherboard bios. My server is Fujitsu RX2530 M4 and my bios allows me to disable NUMA. In your case you may need to change NUMA from 2-way to 1-way or 0-way configurations if disable option not available. Upon reconfiguring your bios run another memory test and look for half memory bandwidth result, so something like ALL READS ~110000 With that result you can be sure that NUMA is now disabled. Even though your memory throughput is now halved (in theory) your actual llm inference will increase due to cpu core's unrestricted access to all available memory, not clustered. Also for me I leave hyperthreading enabled so I can do other tasks on the server while running llm inference.

GiteaMirror commented

2026-05-04 14:40:14 -05:00

@mrdg-sys commented on GitHub (Apr 4, 2025):

Is there any benefit for 70b than 671b except load faster and use less memory?

by the way have you tried the latest qwq 32B Q4 (20GB) model?

I've experimented with many llm models, large and small, and thus far they all give inconsistent results when asked a control question, such as:

How many R letters are in the word strawberry?

The answer varies between 2 and 3 letters and this makes them unreliable. Its important to ask control questions following a series of llm answers to see if its on track... or has lost its mind.

The latest qwq 32B model thus far is always consistent with the answer: always 3 - correct answer.

The reason I bring this up is because this model is small compared to others and can easily fit to gpu memory for fast inference. I think there is no point dealing with large 70B and 671B models that are very slow and give inconsistent results when you can work with a quick tiny one and get correct answers.

Give it a try.

@mrdg-sys commented on GitHub (Apr 4, 2025): > Is there any benefit for 70b than 671b except load faster and use less memory? by the way have you tried the latest qwq 32B Q4 (20GB) model? I've experimented with many llm models, large and small, and thus far they all give inconsistent results when asked a control question, such as: How many R letters are in the word strawberry? The answer varies between 2 and 3 letters and this makes them unreliable. Its important to ask control questions following a series of llm answers to see if its on track... or has lost its mind. The latest qwq 32B model thus far is always consistent with the answer: always 3 - correct answer. The reason I bring this up is because this model is small compared to others and can easily fit to gpu memory for fast inference. I think there is no point dealing with large 70B and 671B models that are very slow and give inconsistent results when you can work with a quick tiny one and get correct answers. Give it a try.

GiteaMirror commented

2026-05-04 14:40:14 -05:00

@NGC13009 commented on GitHub (Apr 4, 2025):

@NGC13009 怎么知道推理时实际每个token激活37b。qwen 是2.83t/s。 r1-1776:671b-q4_K_M是1.8t/s。

deepseek的论文有提供这个参数，一个专家4B大小，一次激活8个就是32B，另外还有一些专家部分以外的计算过程，加起来大概37b左右。

然而实际推理过程不一定完全等价32b左右的模型速度，这是由于实际模型推理的时候，具体用的什么显卡，怎么部署的（多卡并行怎么优化的）都会导致不同的速度。不过一般来说，显卡部署20t/s以上应该就是正常的，cpu能跑到3t/s就算很好的了。这是我的经验参数，仅供参考。

@NGC13009 commented on GitHub (Apr 4, 2025): > [@NGC13009](https://github.com/NGC13009) 怎么知道推理时实际每个token激活37b。qwen 是2.83t/s。 r1-1776:671b-q4_K_M是1.8t/s。 deepseek的论文有提供这个参数，一个专家4B大小，一次激活8个就是32B，另外还有一些专家部分以外的计算过程，加起来大概37b左右。然而实际推理过程不一定完全等价32b左右的模型速度，这是由于实际模型推理的时候，具体用的什么显卡，怎么部署的（多卡并行怎么优化的）都会导致不同的速度。不过一般来说，显卡部署20t/s以上应该就是正常的，cpu能跑到3t/s就算很好的了。这是我的经验参数，仅供参考。

GiteaMirror commented

2026-05-04 14:40:16 -05:00

@fanlessfan commented on GitHub (Apr 4, 2025):

@NGC13009 多谢分享经验

@fanlessfan commented on GitHub (Apr 4, 2025): @NGC13009 多谢分享经验

GiteaMirror commented

2026-05-04 14:40:17 -05:00

@fanlessfan commented on GitHub (Apr 4, 2025):

Hi @rick-github,

I tried disabling NUMA and the performance is worse. even though not much. my memory speed reduced to 150GB/s instead of 100GB/s. what model did you run to get 3t/s? for 384GB memory, you can't run deepseek-r1:671b model.

I tried qwen 20GB and got 2.5t/s. I got 2.8t/s without NUMA disable.

@fanlessfan commented on GitHub (Apr 4, 2025): Hi @rick-github, I tried disabling NUMA and the performance is worse. even though not much. my memory speed reduced to 150GB/s instead of 100GB/s. what model did you run to get 3t/s? for 384GB memory, you can't run deepseek-r1:671b model. I tried qwen 20GB and got 2.5t/s. I got 2.8t/s without NUMA disable.

GiteaMirror commented

2026-05-04 14:40:19 -05:00

@mrdg-sys commented on GitHub (Apr 4, 2025):

When you run ollama leave inference settings as default, don't force or specify any number of threads. By default ollama runs one thread per cpu core. Modifying this thread number will decrease performance.

Due to my ram limitation my 671B model of choise is 2.51 quant, only 220GB.

@mrdg-sys commented on GitHub (Apr 4, 2025): When you run ollama leave inference settings as default, don't force or specify any number of threads. By default ollama runs one thread per cpu core. Modifying this thread number will decrease performance. Due to my ram limitation my 671B model of choise is 2.51 quant, only 220GB.

GiteaMirror commented

2026-05-04 14:40:20 -05:00

@mrdg-sys commented on GitHub (Apr 4, 2025):

" I tried qwen 20GB and got 2.5t/s. I got 2.8t/s without NUMA disable. "

Yes, thats normal because any small model that fits into single dimm module will benefit from NUMA enabled, however large models that span across many dimm modules do better with NUMA disabled.

NUMA = Non Uniform Memory Access (enabled)
UMA = Uniform Memory Access (disabled)

@mrdg-sys commented on GitHub (Apr 4, 2025): " I tried qwen 20GB and got 2.5t/s. I got 2.8t/s without NUMA disable. " Yes, thats normal because any small model that fits into single dimm module will benefit from NUMA enabled, however large models that span across many dimm modules do better with NUMA disabled. NUMA = Non Uniform Memory Access (enabled) UMA = Uniform Memory Access (disabled)

GiteaMirror commented

2026-05-04 14:40:21 -05:00

@fanlessfan commented on GitHub (Apr 4, 2025):

for 671b-q4, I got 1.95t/s with NUMA disable and 2.24t/s when enabled. I also monitored the real time memory speed with intel PCM and it was around 80GB/s when NUMA abled and reduced to 40GB/s when NUMA disabled.

Did you make the 2.51 quant yourself? ollama only provide 671b (404GB) model.

Yes. I leave ollama settings as default. I tried different threads on my i7-13700K (8 big, 8 small cores and 24 threads) and found 16 threads has better performance, better than default. I didn't try 671b model as it only has 96GB RAM. Is there any way we can verify the thread_number after ollama running.

@fanlessfan commented on GitHub (Apr 4, 2025): for 671b-q4, I got 1.95t/s with NUMA disable and 2.24t/s when enabled. I also monitored the real time memory speed with intel PCM and it was around 80GB/s when NUMA abled and reduced to 40GB/s when NUMA disabled. Did you make the 2.51 quant yourself? ollama only provide 671b (404GB) model. Yes. I leave ollama settings as default. I tried different threads on my i7-13700K (8 big, 8 small cores and 24 threads) and found 16 threads has better performance, better than default. I didn't try 671b model as it only has 96GB RAM. Is there any way we can verify the thread_number after ollama running.

GiteaMirror commented

2026-05-04 14:40:23 -05:00

@mrdg-sys commented on GitHub (Apr 4, 2025):

here is download link for 2.51 quant from ollama:
https://ollama.com/Huzderu/deepseek-r1-671b-2.51bit

or simply run from command prompt:
ollama pull Huzderu/deepseek-r1-671b-2.51bit

@mrdg-sys commented on GitHub (Apr 4, 2025): here is download link for 2.51 quant from ollama: https://ollama.com/Huzderu/deepseek-r1-671b-2.51bit or simply run from command prompt: ollama pull Huzderu/deepseek-r1-671b-2.51bit

GiteaMirror commented

2026-05-04 14:40:27 -05:00

@fanlessfan commented on GitHub (Apr 4, 2025):

Thank you @mrdg-sys

@fanlessfan commented on GitHub (Apr 4, 2025): Thank you @mrdg-sys

GiteaMirror commented

2026-05-04 14:40:30 -05:00

@fanlessfan commented on GitHub (Apr 4, 2025):

I tried 671b-2.5bit model and got below result. it's still faster when NUMA enabled. and never reach 3t/s. Maybe more memory is not benefit here or the motherboard makes different.

@fanlessfan commented on GitHub (Apr 4, 2025): I tried 671b-2.5bit model and got below result. it's still faster when NUMA enabled. and never reach 3t/s. Maybe more memory is not benefit here or the motherboard makes different. ![Image](https://github.com/user-attachments/assets/4f0f100d-163b-4c9b-8ab2-5d85b1ec1347)

GiteaMirror commented

2026-05-04 14:40:31 -05:00

@mrdg-sys commented on GitHub (Apr 4, 2025):

is your memory running @ 2666mhz?

@mrdg-sys commented on GitHub (Apr 4, 2025): is your memory running @ 2666mhz?

GiteaMirror commented

2026-05-04 14:40:31 -05:00

@mrdg-sys commented on GitHub (Apr 5, 2025):

@fanlessfan

here are my bios settings and result from 2.51 quant with prompt:
why is the sky blue?

Perhaps its time you consider Windows 11 as the best inference platform for LLM!

@mrdg-sys commented on GitHub (Apr 5, 2025): @fanlessfan here are my bios settings and result from 2.51 quant with prompt: why is the sky blue? ![Image](https://github.com/user-attachments/assets/db852595-7cea-487d-bed8-eca6d8f5f179) ![Image](https://github.com/user-attachments/assets/434e451b-6e4e-4d94-a559-609dafbeefc0) ![Image](https://github.com/user-attachments/assets/fcd759c6-5444-494e-a80c-23c97662a63a) ![Image](https://github.com/user-attachments/assets/731d40f9-0dd6-4199-ae46-66f617e913eb) ![Image](https://github.com/user-attachments/assets/df1d3baa-2956-43da-8ecc-646f66a64271) ![Image](https://github.com/user-attachments/assets/c0c782f1-6433-4435-9f30-00e8cfc766c2) Perhaps its time you consider Windows 11 as the best inference platform for LLM!

GiteaMirror commented

2026-05-04 14:40:32 -05:00

@fanlessfan commented on GitHub (Apr 5, 2025):

Hi @mrdg-sys,

my memory is 2666M.

I tried windows 11 and it's different than linux for NUMA. disable NUMA make ollama faster, but not as fast as Linux enable NUMA.

I think it's hard to compare as there are so many factors. Ollama on different platform, OS different, motherboard, even memory capacity might affect this.

Thank you so much for spending time with me on this.

on Windows 11
NUMA enabled
total duration: 19m6.6652296s
load duration: 6m37.2427768s
prompt eval count: 9 token(s)
prompt eval duration: 31.786227s
prompt eval rate: 0.28 tokens/s
eval count: 1054 token(s)
eval duration: 11m57.6294909s
eval rate: 1.47 tokens/s

NUMA disabled
total duration: 12m17.4283877s
load duration: 6m5.0180974s
prompt eval count: 9 token(s)
prompt eval duration: 2.10005s
prompt eval rate: 4.29 tokens/s
eval count: 843 token(s)
eval duration: 6m10.3025057s
eval rate: 2.28 tokens/s

on Linux
No NUMA
prompt eval rate: 20.156tokens/s
eval rate: 2.147tokens/s
NUMA enabled
prompt eval rate: 20.441tokens/s
eval rate: 2.419tokens/s

@fanlessfan commented on GitHub (Apr 5, 2025): Hi @mrdg-sys, my memory is 2666M. I tried windows 11 and it's different than linux for NUMA. disable NUMA make ollama faster, but not as fast as Linux enable NUMA. I think it's hard to compare as there are so many factors. Ollama on different platform, OS different, motherboard, even memory capacity might affect this. Thank you so much for spending time with me on this. on Windows 11 NUMA enabled total duration: 19m6.6652296s load duration: 6m37.2427768s prompt eval count: 9 token(s) prompt eval duration: 31.786227s prompt eval rate: 0.28 tokens/s eval count: 1054 token(s) eval duration: 11m57.6294909s eval rate: 1.47 tokens/s NUMA disabled total duration: 12m17.4283877s load duration: 6m5.0180974s prompt eval count: 9 token(s) prompt eval duration: 2.10005s prompt eval rate: 4.29 tokens/s eval count: 843 token(s) eval duration: 6m10.3025057s eval rate: 2.28 tokens/s on Linux No NUMA prompt eval rate: 20.156tokens/s eval rate: 2.147tokens/s NUMA enabled prompt eval rate: 20.441tokens/s eval rate: 2.419tokens/s

GiteaMirror commented

2026-05-04 14:40:32 -05:00

@fanlessfan commented on GitHub (Apr 5, 2025):

by the way below is my memory model. and your memory might be faster than mine. you can use below link to check your memory. it has compiled windows version.

https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html

         description: DIMM DDR4 Synchronous LRDIMM 2666 MHz (0.4 ns)
         product: ST57A8G4UNC26R4-SC
         vendor: Samsung
         physical id: 0
         slot: P1-DIMMA1
         size: 64GiB
         width: 64 bits
         clock: 2666MHz (0.4ns)

@fanlessfan commented on GitHub (Apr 5, 2025): by the way below is my memory model. and your memory might be faster than mine. you can use below link to check your memory. it has compiled windows version. https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html description: DIMM DDR4 Synchronous LRDIMM 2666 MHz (0.4 ns) product: ST57A8G4UNC26R4-SC vendor: Samsung physical id: 0 slot: P1-DIMMA1 size: 64GiB width: 64 bits clock: 2666MHz (0.4ns)

GiteaMirror commented

2026-05-04 14:40:33 -05:00

@mrdg-sys commented on GitHub (Apr 5, 2025):

Ollama does have some NUMA support on linux and that explains your inference boost with NUMA enabled. Windows situation is the opposite because no support at the moment. However my Windows inference results are better than with linux.

Perhaps its all down to our hardware...

@mrdg-sys commented on GitHub (Apr 5, 2025): Ollama does have some NUMA support on linux and that explains your inference boost with NUMA enabled. Windows situation is the opposite because no support at the moment. However my Windows inference results are better than with linux. Perhaps its all down to our hardware...

GiteaMirror commented

2026-05-04 14:40:34 -05:00

@mrdg-sys commented on GitHub (Apr 5, 2025):

by the way below is my memory model. and your memory might be faster than mine. you can use below link to check your memory. it has compiled windows version.

https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html
         description: DIMM DDR4 Synchronous LRDIMM 2666 MHz (0.4 ns)
         product: ST57A8G4UNC26R4-SC
         vendor: Samsung
         physical id: 0
         slot: P1-DIMMA1
         size: 64GiB
         width: 64 bits
         clock: 2666MHz (0.4ns)

I give it a try next week when back to office

@mrdg-sys commented on GitHub (Apr 5, 2025): > by the way below is my memory model. and your memory might be faster than mine. you can use below link to check your memory. it has compiled windows version. > > https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html > > ``` > description: DIMM DDR4 Synchronous LRDIMM 2666 MHz (0.4 ns) > product: ST57A8G4UNC26R4-SC > vendor: Samsung > physical id: 0 > slot: P1-DIMMA1 > size: 64GiB > width: 64 bits > clock: 2666MHz (0.4ns) > ``` I give it a try next week when back to office

Sign in to join this conversation.

Branches Tags

main

parth-mlx-decode-checkpoints

dhiltgen/ci

hoyyeva/editor-config-repair

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

hoyyeva/launch-backup-ux

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

brucemacd/download-before-remove

parth/update-claude-docs

parth-anthropic-reference-images-path

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#68634