[GH-ISSUE #10592] AVX-512 Vector Neural Network Instructions ? #53482

Open
opened 2026-04-29 03:21:09 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @alsavu on GitHub (May 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10592

Hello, is there any planning for future releases to also use the AVX-512 VNNI CPU instruction set ?

I currently sport the following config; Lenovo ThinkStation P920, 2x Intel Xeon Platinum 8280, 1TB DDR4 RAM, 4x Nvidia RTX A4000 GPUs; and some models run over the VRAM into the system RAM, which then triggers the CPU, which is fine, but it is significantly slower. (2-5 tokens/s)

Checking the server.log file, I can see the following:

"load_backend: loaded CPU backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
Device 1: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
Device 2: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
Device 3: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-05-06T18:10:09.448+02:00 level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1"

Initially I saw that it stops at AVX512=1, then I noticed that the cpu profile loaded is called Skylakex, which is the 1st Generation of Intel Xeon Scalable Processors, which do not have the AVX-512 VNNI instruction set.

Is there any Cascade Lake-SP (2nd Generation of Intel Xeon Scalable Processors) support planned or AVX-512 VNNI instruction set support planned in the future ?

Just asking...

Thank you very much for everything you are doing, it helped me learn a lot.

Regards,
Alexandru

Originally created by @alsavu on GitHub (May 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10592 Hello, is there any planning for future releases to also use the AVX-512 VNNI CPU instruction set ? I currently sport the following config; Lenovo ThinkStation P920, 2x Intel Xeon Platinum 8280, 1TB DDR4 RAM, 4x Nvidia RTX A4000 GPUs; and some models run over the VRAM into the system RAM, which then triggers the CPU, which is fine, but it is significantly slower. (2-5 tokens/s) Checking the server.log file, I can see the following: "load_backend: loaded CPU backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 4 CUDA devices: Device 0: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 1: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 2: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 3: NVIDIA RTX A4000, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-05-06T18:10:09.448+02:00 level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1" Initially I saw that it stops at AVX512=1, then I noticed that the cpu profile loaded is called Skylakex, which is the 1st Generation of Intel Xeon Scalable Processors, which do not have the AVX-512 VNNI instruction set. Is there any Cascade Lake-SP (2nd Generation of Intel Xeon Scalable Processors) support planned or AVX-512 VNNI instruction set support planned in the future ? Just asking... Thank you very much for everything you are doing, it helped me learn a lot. Regards, Alexandru
GiteaMirror added the feature request label 2026-04-29 03:21:09 -05:00
Author
Owner

@rick-github commented on GitHub (May 7, 2025):

AVX-512 VNNI is supported on my 11th Gen Intel(R) Core(TM) i7-11800H by loading libggml-cpu-icelake.so.

time=2025-05-06T22:56:49.993Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1
 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1
 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1
 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1
 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
<!-- gh-comment-id:2856650970 --> @rick-github commented on GitHub (May 7, 2025): AVX-512 VNNI is supported on my 11th Gen Intel(R) Core(TM) i7-11800H by loading libggml-cpu-icelake.so. ``` time=2025-05-06T22:56:49.993Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) ```
Author
Owner

@alsavu commented on GitHub (May 7, 2025):

So your mobile CPU from the 11th Gen codename "Tiger Lake" or as the desktop version codename is "Rocket Lake" is treated as "Ice Lake" which is the codename for the 10th Gen mobile CPUs., and my 2nd Gen Xeon Scalable CPU codenamed "Cascade Lake-SP" is treated as codename "SkylakeX" which is 7th Gen Desktop Intel CPUs.

Although, this is not an "apples to apples" comparison, as I am pretty sure we should not compare desktop/mobile CPU generations with server CPU generation...

I would love to know if there is any way I can have Ollama use the full power and portential of the Cascade Lake-SP (2nd Generation of Intel Xeon Scalable Processors) Intel Xeon Platinum 8280 CPU.

If anyone has any ideas how I can achieve this on Windows for now, I would be grateful.

Thank you.

Regards,
Alexandru

<!-- gh-comment-id:2859014021 --> @alsavu commented on GitHub (May 7, 2025): So your mobile CPU from the 11th Gen codename "Tiger Lake" or as the desktop version codename is "Rocket Lake" is treated as "Ice Lake" which is the codename for the 10th Gen mobile CPUs., and my 2nd Gen Xeon Scalable CPU codenamed "Cascade Lake-SP" is treated as codename "SkylakeX" which is 7th Gen Desktop Intel CPUs. Although, this is not an "apples to apples" comparison, as I am pretty sure we should not compare desktop/mobile CPU generations with server CPU generation... I would love to know if there is any way I can have Ollama use the full power and portential of the Cascade Lake-SP (2nd Generation of Intel Xeon Scalable Processors) Intel Xeon Platinum 8280 CPU. If anyone has any ideas how I can achieve this on Windows for now, I would be grateful. Thank you. Regards, Alexandru
Author
Owner

@rick-github commented on GitHub (May 7, 2025):

ollama decides on which backed to use based on a feature count, highest score wins. You could try renaming ggml-cpu-skylakex.dll to ggml-cpu-skylakex.dll.old and see if the next backend meets your needs.

<!-- gh-comment-id:2859060053 --> @rick-github commented on GitHub (May 7, 2025): ollama decides on which backed to use based on a feature count, highest score wins. You could try renaming ggml-cpu-skylakex.dll to ggml-cpu-skylakex.dll.old and see if the next backend meets your needs.
Author
Owner

@alsavu commented on GitHub (May 7, 2025):

I already tried that, and it falls back to Haswell dll, and from the logs and Intel's wiki page, this is an even older CPU generation, without AVX-512, let alone the VNNI instructions.

I tried renaming them all apart from Icelake, but it does not choose that one.

<!-- gh-comment-id:2859097714 --> @alsavu commented on GitHub (May 7, 2025): I already tried that, and it falls back to Haswell dll, and from the logs and Intel's wiki page, this is an even older CPU generation, without AVX-512, let alone the VNNI instructions. I tried renaming them all apart from Icelake, but it does not choose that one.
Author
Owner

@volkertb commented on GitHub (Mar 19, 2026):

I hope that a maintainer can review and approve PR #14436 soon. Anyone in particular to whom we can best bring this to attention? It's a trivial one-line fix.

<!-- gh-comment-id:4093283353 --> @volkertb commented on GitHub (Mar 19, 2026): I hope that a maintainer can review and approve PR #14436 soon. Anyone in particular to whom we can best bring this to attention? It's a trivial one-line fix.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53482