[GH-ISSUE #5745] Upgraded hardware no change in performance #29338

Closed
opened 2026-04-22 08:06:32 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @AncientMystic on GitHub (Jul 17, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5745

Wondering if this is a bug in ollama, i just upgraded my core hardware and was hoping to see a slight improvement in token/s

From
i7-6700 (avx2 8k benchmark score)
DDR4 64gb 2133mhz (17GB/s)

To
I7-7820x (8c/16t avx512 17k benchmark score)
DDR4 96GB quad channel 2600mhz (56GB/s)

Everything about the new hardware is at least 2x faster plus avx-512 which should be able to deliver up to 5-6x the performance

But testing models i seem to be getting the exact same token/s performance, literally no change at all not even 1 token/s higher, many of the larger models are still 0.9-2 t/s and 5 for the slightly smaller, 24 t/s (at most) for the gpu accelerated models that will fit in vram.

It seems unexpected to see no changes at all in that big of a performance leap, the first one was also running in a vm the motherboard wasn't handling properly with no resizable bar or any other performance improving features that the new system has, that alone should be 10-20% improvement

Is there anything i can add to environment variables or something to maybe fix this?

Originally created by @AncientMystic on GitHub (Jul 17, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5745 Wondering if this is a bug in ollama, i just upgraded my core hardware and was hoping to see a slight improvement in token/s From i7-6700 (avx2 8k benchmark score) DDR4 64gb 2133mhz (17GB/s) To I7-7820x (8c/16t avx512 17k benchmark score) DDR4 96GB quad channel 2600mhz (56GB/s) Everything about the new hardware is at least 2x faster plus avx-512 which should be able to deliver up to 5-6x the performance But testing models i seem to be getting the exact same token/s performance, literally no change at all not even 1 token/s higher, many of the larger models are still 0.9-2 t/s and 5 for the slightly smaller, 24 t/s (at most) for the gpu accelerated models that will fit in vram. It seems unexpected to see no changes at all in that big of a performance leap, the first one was also running in a vm the motherboard wasn't handling properly with no resizable bar or any other performance improving features that the new system has, that alone should be 10-20% improvement Is there anything i can add to environment variables or something to maybe fix this?
Author
Owner

@rick-github commented on GitHub (Jul 17, 2024):

More information would be needed to diagnose the issue. Model? Version of ollama? GPU type? Output of nvidia-smi or other monitoring? CPU load during inference? Output of vmstat or iostat or other system statistics?

<!-- gh-comment-id:2233452535 --> @rick-github commented on GitHub (Jul 17, 2024): More information would be needed to diagnose the issue. Model? Version of ollama? GPU type? Output of `nvidia-smi` or other monitoring? CPU load during inference? Output of `vmstat` or `iostat` or other system statistics?
Author
Owner

@dhiltgen commented on GitHub (Jul 23, 2024):

I believe those two CPUs are from the same generation, and their clock speeds are ~close. You mention benchmarks but not which ones, so I'm guessing they're not specifically vector math performance comparisons.

which should be able to deliver up to 5-6x the performance

I'm not sure where you got that data. In my testing, AVX512 didn't outperform AVX2, which is why we haven't compiled a specific runner for that vector feature. We might offer a AVX512 runner in the future, and this is tracked via #2205

Where you'll really see big performance improvements is using a GPU. I'm not aware of any silver bullet to get significant CPU performance benefits, and even over multiple generations of Intel (or AMD) CPUs, I only see incremental changes in performance.

Some examples with llama3 and the cpu_avx2 runner

  • 13th gen i9-13900K: 12.39 tokens/s
  • 13th gen i7-13700KF: 9.27 tokens/s
  • 12th gen i5-12400F: 11.53 tokens/s
<!-- gh-comment-id:2244024216 --> @dhiltgen commented on GitHub (Jul 23, 2024): I believe those two CPUs are from the same generation, and their clock speeds are ~close. You mention benchmarks but not which ones, so I'm guessing they're not specifically vector math performance comparisons. > which should be able to deliver up to 5-6x the performance I'm not sure where you got that data. In my testing, AVX512 didn't outperform AVX2, which is why we haven't compiled a specific runner for that vector feature. We might offer a AVX512 runner in the future, and this is tracked via #2205 Where you'll really see big performance improvements is using a GPU. I'm not aware of any silver bullet to get significant CPU performance benefits, and even over multiple generations of Intel (or AMD) CPUs, I only see incremental changes in performance. Some examples with llama3 and the `cpu_avx2` runner - 13th gen i9-13900K: 12.39 tokens/s - 13th gen i7-13700KF: 9.27 tokens/s - 12th gen i5-12400F: 11.53 tokens/s
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#29338