[GH-ISSUE #10030] Deepseek R1, 671b is faster than 70b #68634

Open
opened 2026-05-04 14:39:48 -05:00 by GiteaMirror · 30 comments
Owner

Originally created by @fanlessfan on GitHub (Mar 28, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10030

Hello,

I tried to run deepseek-r1 using CPU only with dual Xeon 6138 with 768GB memory. the result is 671b (1.74t/s) is faster than 70b (1.53t/s), even though 671b model takes a longer time. I also tried r1-1777 671b-q8 (713GB) model and it's (1.29t/s) and not slow that much.

Could anyone explain it?

thx

Image

Image

Image

Originally created by @fanlessfan on GitHub (Mar 28, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10030 Hello, I tried to run deepseek-r1 using CPU only with dual Xeon 6138 with 768GB memory. the result is 671b (1.74t/s) is faster than 70b (1.53t/s), even though 671b model takes a longer time. I also tried r1-1777 671b-q8 (713GB) model and it's (1.29t/s) and not slow that much. Could anyone explain it? thx ![Image](https://github.com/user-attachments/assets/29a58daa-a829-4810-bd9c-968781e1a1eb) ![Image](https://github.com/user-attachments/assets/93f9001c-6f46-488d-adb1-f52b0a8614cb) ![Image](https://github.com/user-attachments/assets/653c150b-b438-4588-8754-2fa69ec15268)
GiteaMirror added the performance label 2026-05-04 14:39:48 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 28, 2025):

Multi CPU/NUMA systems can suffer from thread contention (#2936, #8074, #10022). Try reducing the thread count and see if performance changes.

<!-- gh-comment-id:2762218066 --> @rick-github commented on GitHub (Mar 28, 2025): Multi CPU/NUMA systems can suffer from thread contention (#2936, #8074, #10022). Try reducing the [thread count](https://github.com/ollama/ollama/issues/9857#issuecomment-2758868781) and see if performance changes.
Author
Owner

@fanlessfan commented on GitHub (Mar 28, 2025):

I think there is something else affect the result. the 70b model is distilled llama 70b model and I run the 70b llama is the same as deepseek distilled 70 model, but 671b model is deepseek itself, which is faster. the deepseek 713GB model should be around half the speed of deepseek 404GB model, but only 25% slow 1.29/1.74=75%.

anyone can explain it?

<!-- gh-comment-id:2762556383 --> @fanlessfan commented on GitHub (Mar 28, 2025): I think there is something else affect the result. the 70b model is distilled llama 70b model and I run the 70b llama is the same as deepseek distilled 70 model, but 671b model is deepseek itself, which is faster. the deepseek 713GB model should be around half the speed of deepseek 404GB model, but only 25% slow 1.29/1.74=75%. anyone can explain it?
Author
Owner

@fighter3005 commented on GitHub (Mar 28, 2025):

Idk about Q4 vs Q8, but 70B should be slower than 671B, since the 671B only has 38B active paramerters. 671B takes a lot longer to load, obviously. Maybe it took longer, because it used more tokens for thinking?

When I test llama3.2-vision-11B Q4 vs Q8 I also see roughly 25% uplift. Dunno if that is normal though. (RTX 3090)

<!-- gh-comment-id:2762717247 --> @fighter3005 commented on GitHub (Mar 28, 2025): Idk about Q4 vs Q8, but 70B should be slower than 671B, since the 671B only has 38B active paramerters. 671B takes a lot longer to load, obviously. Maybe it took longer, because it used more tokens for thinking? When I test llama3.2-vision-11B Q4 vs Q8 I also see roughly 25% uplift. Dunno if that is normal though. (RTX 3090)
Author
Owner

@fanlessfan commented on GitHub (Mar 28, 2025):

Is there any benefit for 70b than 671b except load faster and use less memory?

<!-- gh-comment-id:2762770339 --> @fanlessfan commented on GitHub (Mar 28, 2025): Is there any benefit for 70b than 671b except load faster and use less memory?
Author
Owner

@fighter3005 commented on GitHub (Mar 29, 2025):

Again, not an Expert, but since the 671B model is HUGE, it does not fit on many systems. In that case, 70B would be beneficial. Especially, if you don't have a couple 100gb vram laying around. But since you can fit 671B in memory, I guess there is no reason to use the 70B model, other than running multiple instances or maybe longer context, etc.

<!-- gh-comment-id:2762937749 --> @fighter3005 commented on GitHub (Mar 29, 2025): Again, not an Expert, but since the 671B model is HUGE, it does not fit on many systems. In that case, 70B would be beneficial. Especially, if you don't have a couple 100gb vram laying around. But since you can fit 671B in memory, I guess there is no reason to use the 70B model, other than running multiple instances or maybe longer context, etc.
Author
Owner

@NGC13009 commented on GitHub (Mar 29, 2025):

70b是蒸馏的llama,GPU需要实打实的计算70b的各种运算。
671b是deepseek的MoE架构,推理时实际每个token激活37b左右,所以虽然它很大,但是实际运算的时候,速度是37b模型的速度,你应该对标接近37b的dense模型的速度,比如qwen:32b,就差不多了。

<!-- gh-comment-id:2763238691 --> @NGC13009 commented on GitHub (Mar 29, 2025): 70b是蒸馏的llama,GPU需要实打实的计算70b的各种运算。 671b是deepseek的MoE架构,推理时实际每个token激活37b左右,所以虽然它很大,但是实际运算的时候,速度是37b模型的速度,你应该对标接近37b的dense模型的速度,比如qwen:32b,就差不多了。
Author
Owner

@navr32 commented on GitHub (Mar 29, 2025):

Do you have avx512 enable ?

<!-- gh-comment-id:2763341241 --> @navr32 commented on GitHub (Mar 29, 2025): Do you have avx512 enable ?
Author
Owner

@fanlessfan commented on GitHub (Mar 29, 2025):

@navr32 I just installed standard ollama on ubuntu server. Is there any command that I can check the AVX512?

<!-- gh-comment-id:2763543812 --> @fanlessfan commented on GitHub (Mar 29, 2025): @navr32 I just installed standard ollama on ubuntu server. Is there any command that I can check the AVX512?
Author
Owner

@rick-github commented on GitHub (Mar 29, 2025):

lscpu | grep avx512
<!-- gh-comment-id:2763584532 --> @rick-github commented on GitHub (Mar 29, 2025): ``` lscpu | grep avx512 ```
Author
Owner

@fanlessfan commented on GitHub (Mar 29, 2025):

@navr32 does the output mean avx512 enabled? thx

lscpu | grep avx512
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke md_clear flush_l1d arch_capabilities

<!-- gh-comment-id:2763607660 --> @fanlessfan commented on GitHub (Mar 29, 2025): @navr32 does the output mean avx512 enabled? thx lscpu | grep avx512 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke md_clear flush_l1d arch_capabilities
Author
Owner

@fanlessfan commented on GitHub (Mar 29, 2025):

@NGC13009 怎么知道推理时实际每个token激活37b。qwen 是2.83t/s。 r1-1776:671b-q4_K_M是1.8t/s。

<!-- gh-comment-id:2763618733 --> @fanlessfan commented on GitHub (Mar 29, 2025): @NGC13009 怎么知道推理时实际每个token激活37b。[qwen](qwq:32b-q4_K_M) 是2.83t/s。 r1-1776:671b-q4_K_M是1.8t/s。
Author
Owner

@mrdg-sys commented on GitHub (Apr 3, 2025):

Hi,

I have some insight regarding your dual cpu system and few tricks to speed-up cpu only inference results.
My system is very similar to yours: dual 6138 xeons with avx512 and 385GB of RAM, no gpu. My ram configuration is specific for maximum memory throughput: one dimm / memory channel and 12 dimms total between 2 cpu x 6 memory channel / cpu.

With this configuration I am able to get an average of 3 tokens/second cpu only inference. This is the best result we can achieve without NUMA support from Ollama. If later this support is introduced in the software then perhaps we can reach 5 t/s with our dual cpu systems.

I did try another memory configuration where I fill every available memory dimm slot on my motherboard, hence doubling my total available ram, however this resulted in slower inference speed (down to 2 t/s) because RAM speed downgrade from 2666mhz to 2133mhz when all slots filled.

<!-- gh-comment-id:2776779762 --> @mrdg-sys commented on GitHub (Apr 3, 2025): Hi, I have some insight regarding your dual cpu system and few tricks to speed-up cpu only inference results. My system is very similar to yours: dual 6138 xeons with avx512 and 385GB of RAM, no gpu. My ram configuration is specific for maximum memory throughput: one dimm / memory channel and 12 dimms total between 2 cpu x 6 memory channel / cpu. With this configuration I am able to get an average of 3 tokens/second cpu only inference. This is the best result we can achieve without NUMA support from Ollama. If later this support is introduced in the software then perhaps we can reach 5 t/s with our dual cpu systems. I did try another memory configuration where I fill every available memory dimm slot on my motherboard, hence doubling my total available ram, however this resulted in slower inference speed (down to 2 t/s) because RAM speed downgrade from 2666mhz to 2133mhz when all slots filled.
Author
Owner

@fanlessfan commented on GitHub (Apr 4, 2025):

Hi @mrdg-sys,

I think we have the exactly same config except I have 64GBx12=768GB RAM. I have X11DPH-T motherboard. how did you get 3 tokens/s?

here is my memory speed.

Intel(R) Memory Latency Checker - v3.11b
*** Unable to modify prefetchers (try executing 'modprobe msr')
*** So, enabling random access for latency measurements
Measuring idle latencies for random access (in ns)...
Numa node
Numa node 0 1
0 88.5 142.1
1 142.2 84.1

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 216944.3
3:1 Reads-Writes : 204116.6
2:1 Reads-Writes : 203703.3
1:1 Reads-Writes : 190054.8
Stream-triad like: 178900.1

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Numa node
Numa node 0 1
0 108828.8 51099.7
1 51098.2 108531.9

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec

00000 200.80 217166.8
00002 201.15 217055.5
00008 201.34 217105.4
00015 200.94 216880.5
00050 199.42 214537.3
00100 142.04 192391.5
00200 110.08 120579.6
00300 104.14 83600.2
00400 97.67 64017.2
00500 95.84 51905.7
00700 93.87 37740.0
01000 92.05 26926.5
01300 91.15 20951.5
01700 90.39 16274.2
02500 89.70 11345.9
03500 89.33 8325.9
05000 88.99 6057.5
09000 88.94 3694.7
20000 87.68 2068.5

thx

<!-- gh-comment-id:2777361980 --> @fanlessfan commented on GitHub (Apr 4, 2025): Hi @mrdg-sys, I think we have the exactly same config except I have 64GBx12=768GB RAM. I have X11DPH-T motherboard. how did you get 3 tokens/s? here is my memory speed. Intel(R) Memory Latency Checker - v3.11b *** Unable to modify prefetchers (try executing 'modprobe msr') *** So, enabling random access for latency measurements Measuring idle latencies for random access (in ns)... Numa node Numa node 0 1 0 88.5 142.1 1 142.2 84.1 Measuring Peak Injection Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 216944.3 3:1 Reads-Writes : 204116.6 2:1 Reads-Writes : 203703.3 1:1 Reads-Writes : 190054.8 Stream-triad like: 178900.1 Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Numa node Numa node 0 1 0 108828.8 51099.7 1 51098.2 108531.9 Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 200.80 217166.8 00002 201.15 217055.5 00008 201.34 217105.4 00015 200.94 216880.5 00050 199.42 214537.3 00100 142.04 192391.5 00200 110.08 120579.6 00300 104.14 83600.2 00400 97.67 64017.2 00500 95.84 51905.7 00700 93.87 37740.0 01000 92.05 26926.5 01300 91.15 20951.5 01700 90.39 16274.2 02500 89.70 11345.9 03500 89.33 8325.9 05000 88.99 6057.5 09000 88.94 3694.7 20000 87.68 2068.5 thx
Author
Owner

@mrdg-sys commented on GitHub (Apr 4, 2025):

I think I can spot your issue already it has to do with your memory bandwidth result:

ALL Reads : 216944.3

to achieve this result your motherboard's bios memory settings have by default NUMA nodes enabled therefore resulting in maximum memory bandwidth by splitting workload between dimms and cpu cores... this configuration actually slow down cpu llm inference. To improve you need to disable all NUMA settings in your motherboard bios.

My server is Fujitsu RX2530 M4 and my bios allows me to disable NUMA. In your case you may need to change NUMA from 2-way to 1-way or 0-way configurations if disable option not available.

Upon reconfiguring your bios run another memory test and look for half memory bandwidth result, so something like ALL READS ~110000

With that result you can be sure that NUMA is now disabled.

Even though your memory throughput is now halved (in theory) your actual llm inference will increase due to cpu core's unrestricted access to all available memory, not clustered.

Also for me I leave hyperthreading enabled so I can do other tasks on the server while running llm inference.

<!-- gh-comment-id:2777379507 --> @mrdg-sys commented on GitHub (Apr 4, 2025): I think I can spot your issue already it has to do with your memory bandwidth result: ALL Reads : 216944.3 to achieve this result your motherboard's bios memory settings have by default NUMA nodes enabled therefore resulting in maximum memory bandwidth by splitting workload between dimms and cpu cores... this configuration actually slow down cpu llm inference. To improve you need to disable all NUMA settings in your motherboard bios. My server is Fujitsu RX2530 M4 and my bios allows me to disable NUMA. In your case you may need to change NUMA from 2-way to 1-way or 0-way configurations if disable option not available. Upon reconfiguring your bios run another memory test and look for half memory bandwidth result, so something like ALL READS ~110000 With that result you can be sure that NUMA is now disabled. Even though your memory throughput is now halved (in theory) your actual llm inference will increase due to cpu core's unrestricted access to all available memory, not clustered. Also for me I leave hyperthreading enabled so I can do other tasks on the server while running llm inference.
Author
Owner

@mrdg-sys commented on GitHub (Apr 4, 2025):

Is there any benefit for 70b than 671b except load faster and use less memory?

by the way have you tried the latest qwq 32B Q4 (20GB) model?

I've experimented with many llm models, large and small, and thus far they all give inconsistent results when asked a control question, such as:

How many R letters are in the word strawberry?

The answer varies between 2 and 3 letters and this makes them unreliable. Its important to ask control questions following a series of llm answers to see if its on track... or has lost its mind.

The latest qwq 32B model thus far is always consistent with the answer: always 3 - correct answer.

The reason I bring this up is because this model is small compared to others and can easily fit to gpu memory for fast inference. I think there is no point dealing with large 70B and 671B models that are very slow and give inconsistent results when you can work with a quick tiny one and get correct answers.

Give it a try.

<!-- gh-comment-id:2777651650 --> @mrdg-sys commented on GitHub (Apr 4, 2025): > Is there any benefit for 70b than 671b except load faster and use less memory? by the way have you tried the latest qwq 32B Q4 (20GB) model? I've experimented with many llm models, large and small, and thus far they all give inconsistent results when asked a control question, such as: How many R letters are in the word strawberry? The answer varies between 2 and 3 letters and this makes them unreliable. Its important to ask control questions following a series of llm answers to see if its on track... or has lost its mind. The latest qwq 32B model thus far is always consistent with the answer: always 3 - correct answer. The reason I bring this up is because this model is small compared to others and can easily fit to gpu memory for fast inference. I think there is no point dealing with large 70B and 671B models that are very slow and give inconsistent results when you can work with a quick tiny one and get correct answers. Give it a try.
Author
Owner

@NGC13009 commented on GitHub (Apr 4, 2025):

@NGC13009 怎么知道推理时实际每个token激活37b。qwen 是2.83t/s。 r1-1776:671b-q4_K_M是1.8t/s。

deepseek的论文有提供这个参数,一个专家4B大小,一次激活8个就是32B,另外还有一些专家部分以外的计算过程,加起来大概37b左右。

然而实际推理过程不一定完全等价32b左右的模型速度,这是由于实际模型推理的时候,具体用的什么显卡,怎么部署的(多卡并行怎么优化的)都会导致不同的速度。不过一般来说,显卡部署20t/s以上应该就是正常的,cpu能跑到3t/s就算很好的了。这是我的经验参数,仅供参考。

<!-- gh-comment-id:2777791579 --> @NGC13009 commented on GitHub (Apr 4, 2025): > [@NGC13009](https://github.com/NGC13009) 怎么知道推理时实际每个token激活37b。qwen 是2.83t/s。 r1-1776:671b-q4_K_M是1.8t/s。 deepseek的论文有提供这个参数,一个专家4B大小,一次激活8个就是32B,另外还有一些专家部分以外的计算过程,加起来大概37b左右。 然而实际推理过程不一定完全等价32b左右的模型速度,这是由于实际模型推理的时候,具体用的什么显卡,怎么部署的(多卡并行怎么优化的)都会导致不同的速度。不过一般来说,显卡部署20t/s以上应该就是正常的,cpu能跑到3t/s就算很好的了。这是我的经验参数,仅供参考。
Author
Owner

@fanlessfan commented on GitHub (Apr 4, 2025):

@NGC13009 多谢分享经验

<!-- gh-comment-id:2778572381 --> @fanlessfan commented on GitHub (Apr 4, 2025): @NGC13009 多谢分享经验
Author
Owner

@fanlessfan commented on GitHub (Apr 4, 2025):

Hi @rick-github,

I tried disabling NUMA and the performance is worse. even though not much. my memory speed reduced to 150GB/s instead of 100GB/s. what model did you run to get 3t/s? for 384GB memory, you can't run deepseek-r1:671b model.

I tried qwen 20GB and got 2.5t/s. I got 2.8t/s without NUMA disable.

<!-- gh-comment-id:2779224858 --> @fanlessfan commented on GitHub (Apr 4, 2025): Hi @rick-github, I tried disabling NUMA and the performance is worse. even though not much. my memory speed reduced to 150GB/s instead of 100GB/s. what model did you run to get 3t/s? for 384GB memory, you can't run deepseek-r1:671b model. I tried qwen 20GB and got 2.5t/s. I got 2.8t/s without NUMA disable.
Author
Owner

@mrdg-sys commented on GitHub (Apr 4, 2025):

When you run ollama leave inference settings as default, don't force or specify any number of threads. By default ollama runs one thread per cpu core. Modifying this thread number will decrease performance.

Due to my ram limitation my 671B model of choise is 2.51 quant, only 220GB.

<!-- gh-comment-id:2779237353 --> @mrdg-sys commented on GitHub (Apr 4, 2025): When you run ollama leave inference settings as default, don't force or specify any number of threads. By default ollama runs one thread per cpu core. Modifying this thread number will decrease performance. Due to my ram limitation my 671B model of choise is 2.51 quant, only 220GB.
Author
Owner

@mrdg-sys commented on GitHub (Apr 4, 2025):

" I tried qwen 20GB and got 2.5t/s. I got 2.8t/s without NUMA disable. "

Yes, thats normal because any small model that fits into single dimm module will benefit from NUMA enabled, however large models that span across many dimm modules do better with NUMA disabled.

NUMA = Non Uniform Memory Access (enabled)
UMA = Uniform Memory Access (disabled)

<!-- gh-comment-id:2779247399 --> @mrdg-sys commented on GitHub (Apr 4, 2025): " I tried qwen 20GB and got 2.5t/s. I got 2.8t/s without NUMA disable. " Yes, thats normal because any small model that fits into single dimm module will benefit from NUMA enabled, however large models that span across many dimm modules do better with NUMA disabled. NUMA = Non Uniform Memory Access (enabled) UMA = Uniform Memory Access (disabled)
Author
Owner

@fanlessfan commented on GitHub (Apr 4, 2025):

for 671b-q4, I got 1.95t/s with NUMA disable and 2.24t/s when enabled. I also monitored the real time memory speed with intel PCM and it was around 80GB/s when NUMA abled and reduced to 40GB/s when NUMA disabled.

Did you make the 2.51 quant yourself? ollama only provide 671b (404GB) model.

Yes. I leave ollama settings as default. I tried different threads on my i7-13700K (8 big, 8 small cores and 24 threads) and found 16 threads has better performance, better than default. I didn't try 671b model as it only has 96GB RAM. Is there any way we can verify the thread_number after ollama running.

<!-- gh-comment-id:2779325061 --> @fanlessfan commented on GitHub (Apr 4, 2025): for 671b-q4, I got 1.95t/s with NUMA disable and 2.24t/s when enabled. I also monitored the real time memory speed with intel PCM and it was around 80GB/s when NUMA abled and reduced to 40GB/s when NUMA disabled. Did you make the 2.51 quant yourself? ollama only provide 671b (404GB) model. Yes. I leave ollama settings as default. I tried different threads on my i7-13700K (8 big, 8 small cores and 24 threads) and found 16 threads has better performance, better than default. I didn't try 671b model as it only has 96GB RAM. Is there any way we can verify the thread_number after ollama running.
Author
Owner

@mrdg-sys commented on GitHub (Apr 4, 2025):

here is download link for 2.51 quant from ollama:
https://ollama.com/Huzderu/deepseek-r1-671b-2.51bit

or simply run from command prompt:
ollama pull Huzderu/deepseek-r1-671b-2.51bit

<!-- gh-comment-id:2779329918 --> @mrdg-sys commented on GitHub (Apr 4, 2025): here is download link for 2.51 quant from ollama: https://ollama.com/Huzderu/deepseek-r1-671b-2.51bit or simply run from command prompt: ollama pull Huzderu/deepseek-r1-671b-2.51bit
Author
Owner

@fanlessfan commented on GitHub (Apr 4, 2025):

Thank you @mrdg-sys

<!-- gh-comment-id:2779338567 --> @fanlessfan commented on GitHub (Apr 4, 2025): Thank you @mrdg-sys
Author
Owner

@fanlessfan commented on GitHub (Apr 4, 2025):

I tried 671b-2.5bit model and got below result. it's still faster when NUMA enabled. and never reach 3t/s. Maybe more memory is not benefit here or the motherboard makes different.

Image

<!-- gh-comment-id:2779852006 --> @fanlessfan commented on GitHub (Apr 4, 2025): I tried 671b-2.5bit model and got below result. it's still faster when NUMA enabled. and never reach 3t/s. Maybe more memory is not benefit here or the motherboard makes different. ![Image](https://github.com/user-attachments/assets/4f0f100d-163b-4c9b-8ab2-5d85b1ec1347)
Author
Owner

@mrdg-sys commented on GitHub (Apr 4, 2025):

is your memory running @ 2666mhz?

<!-- gh-comment-id:2779891346 --> @mrdg-sys commented on GitHub (Apr 4, 2025): is your memory running @ 2666mhz?
Author
Owner

@mrdg-sys commented on GitHub (Apr 5, 2025):

@fanlessfan

here are my bios settings and result from 2.51 quant with prompt:
why is the sky blue?

Image

Image

Image

Image

Image

Image

Perhaps its time you consider Windows 11 as the best inference platform for LLM!

<!-- gh-comment-id:2779967560 --> @mrdg-sys commented on GitHub (Apr 5, 2025): @fanlessfan here are my bios settings and result from 2.51 quant with prompt: why is the sky blue? ![Image](https://github.com/user-attachments/assets/db852595-7cea-487d-bed8-eca6d8f5f179) ![Image](https://github.com/user-attachments/assets/434e451b-6e4e-4d94-a559-609dafbeefc0) ![Image](https://github.com/user-attachments/assets/fcd759c6-5444-494e-a80c-23c97662a63a) ![Image](https://github.com/user-attachments/assets/731d40f9-0dd6-4199-ae46-66f617e913eb) ![Image](https://github.com/user-attachments/assets/df1d3baa-2956-43da-8ecc-646f66a64271) ![Image](https://github.com/user-attachments/assets/c0c782f1-6433-4435-9f30-00e8cfc766c2) Perhaps its time you consider Windows 11 as the best inference platform for LLM!
Author
Owner

@fanlessfan commented on GitHub (Apr 5, 2025):

Hi @mrdg-sys,

my memory is 2666M.

I tried windows 11 and it's different than linux for NUMA. disable NUMA make ollama faster, but not as fast as Linux enable NUMA.

I think it's hard to compare as there are so many factors. Ollama on different platform, OS different, motherboard, even memory capacity might affect this.

Thank you so much for spending time with me on this.

on Windows 11
NUMA enabled
total duration: 19m6.6652296s
load duration: 6m37.2427768s
prompt eval count: 9 token(s)
prompt eval duration: 31.786227s
prompt eval rate: 0.28 tokens/s
eval count: 1054 token(s)
eval duration: 11m57.6294909s
eval rate: 1.47 tokens/s

NUMA disabled
total duration: 12m17.4283877s
load duration: 6m5.0180974s
prompt eval count: 9 token(s)
prompt eval duration: 2.10005s
prompt eval rate: 4.29 tokens/s
eval count: 843 token(s)
eval duration: 6m10.3025057s
eval rate: 2.28 tokens/s

on Linux
No NUMA
prompt eval rate: 20.156tokens/s
eval rate: 2.147tokens/s
NUMA enabled
prompt eval rate: 20.441tokens/s
eval rate: 2.419tokens/s

<!-- gh-comment-id:2780173040 --> @fanlessfan commented on GitHub (Apr 5, 2025): Hi @mrdg-sys, my memory is 2666M. I tried windows 11 and it's different than linux for NUMA. disable NUMA make ollama faster, but not as fast as Linux enable NUMA. I think it's hard to compare as there are so many factors. Ollama on different platform, OS different, motherboard, even memory capacity might affect this. Thank you so much for spending time with me on this. on Windows 11 NUMA enabled total duration: 19m6.6652296s load duration: 6m37.2427768s prompt eval count: 9 token(s) prompt eval duration: 31.786227s prompt eval rate: 0.28 tokens/s eval count: 1054 token(s) eval duration: 11m57.6294909s eval rate: 1.47 tokens/s NUMA disabled total duration: 12m17.4283877s load duration: 6m5.0180974s prompt eval count: 9 token(s) prompt eval duration: 2.10005s prompt eval rate: 4.29 tokens/s eval count: 843 token(s) eval duration: 6m10.3025057s eval rate: 2.28 tokens/s on Linux No NUMA prompt eval rate: 20.156tokens/s eval rate: 2.147tokens/s NUMA enabled prompt eval rate: 20.441tokens/s eval rate: 2.419tokens/s
Author
Owner

@fanlessfan commented on GitHub (Apr 5, 2025):

by the way below is my memory model. and your memory might be faster than mine. you can use below link to check your memory. it has compiled windows version.

https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html

         description: DIMM DDR4 Synchronous LRDIMM 2666 MHz (0.4 ns)
         product: ST57A8G4UNC26R4-SC
         vendor: Samsung
         physical id: 0
         slot: P1-DIMMA1
         size: 64GiB
         width: 64 bits
         clock: 2666MHz (0.4ns)
<!-- gh-comment-id:2780174981 --> @fanlessfan commented on GitHub (Apr 5, 2025): by the way below is my memory model. and your memory might be faster than mine. you can use below link to check your memory. it has compiled windows version. https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html description: DIMM DDR4 Synchronous LRDIMM 2666 MHz (0.4 ns) product: ST57A8G4UNC26R4-SC vendor: Samsung physical id: 0 slot: P1-DIMMA1 size: 64GiB width: 64 bits clock: 2666MHz (0.4ns)
Author
Owner

@mrdg-sys commented on GitHub (Apr 5, 2025):

Ollama does have some NUMA support on linux and that explains your inference boost with NUMA enabled. Windows situation is the opposite because no support at the moment. However my Windows inference results are better than with linux.

Perhaps its all down to our hardware...

<!-- gh-comment-id:2780175622 --> @mrdg-sys commented on GitHub (Apr 5, 2025): Ollama does have some NUMA support on linux and that explains your inference boost with NUMA enabled. Windows situation is the opposite because no support at the moment. However my Windows inference results are better than with linux. Perhaps its all down to our hardware...
Author
Owner

@mrdg-sys commented on GitHub (Apr 5, 2025):

by the way below is my memory model. and your memory might be faster than mine. you can use below link to check your memory. it has compiled windows version.

https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html

         description: DIMM DDR4 Synchronous LRDIMM 2666 MHz (0.4 ns)
         product: ST57A8G4UNC26R4-SC
         vendor: Samsung
         physical id: 0
         slot: P1-DIMMA1
         size: 64GiB
         width: 64 bits
         clock: 2666MHz (0.4ns)

I give it a try next week when back to office

<!-- gh-comment-id:2780180448 --> @mrdg-sys commented on GitHub (Apr 5, 2025): > by the way below is my memory model. and your memory might be faster than mine. you can use below link to check your memory. it has compiled windows version. > > https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html > > ``` > description: DIMM DDR4 Synchronous LRDIMM 2666 MHz (0.4 ns) > product: ST57A8G4UNC26R4-SC > vendor: Samsung > physical id: 0 > slot: P1-DIMMA1 > size: 64GiB > width: 64 bits > clock: 2666MHz (0.4ns) > ``` I give it a try next week when back to office
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68634