[GH-ISSUE #6312] how to force ollama to use different cpu runners / how to compile windows avx512 runner? #65996

Closed
opened 2026-05-03 23:29:10 -05:00 by GiteaMirror · 17 comments
Owner

Originally created by @AncientMystic on GitHub (Aug 11, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6312

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

according to logs ollama seems to only be using AVX not AVX2, how would i fix this and force avx2 or higher?

also wondering how i compile the avx512 runner for windows? i have compiled other runners and the cuda runner fine but it seems regardless of what i try to set it just generates cpu, avx, avx2 and stops

running $env:OLLAMA_CUSTOM_CPU_DEFS="-DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on" or any other combo of this command seems to have no effect on generating a different runner aside from defaults, is there a file i need to edit to change what is compiled? i would also like to skip cpu, avx, avx2 if possible as it takes a fair amount of time and those are already compiled

cpu is i7-7820x with support for avx2 & avx512

OS

Windows

GPU

Nvidia, Intel

CPU

Intel

Ollama version

0.3.4

Originally created by @AncientMystic on GitHub (Aug 11, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6312 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? according to logs ollama seems to only be using AVX not AVX2, how would i fix this and force avx2 or higher? also wondering how i compile the avx512 runner for windows? i have compiled other runners and the cuda runner fine but it seems regardless of what i try to set it just generates cpu, avx, avx2 and stops running `$env:OLLAMA_CUSTOM_CPU_DEFS="-DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on"` or any other combo of this command seems to have no effect on generating a different runner aside from defaults, is there a file i need to edit to change what is compiled? i would also like to skip cpu, avx, avx2 if possible as it takes a fair amount of time and those are already compiled cpu is i7-7820x with support for avx2 & avx512 ### OS Windows ### GPU Nvidia, Intel ### CPU Intel ### Ollama version 0.3.4
GiteaMirror added the performancebuildnvidiafeature request labels 2026-05-03 23:29:14 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 11, 2024):

You can force ollama to use specific runners by setting OLLAMA_LLM_LIBRARY in the server enviroment, eg OLLAMA_LLM_LIBRARY=cpu_avx2.

<!-- gh-comment-id:2282819738 --> @rick-github commented on GitHub (Aug 11, 2024): You can force ollama to use specific runners by setting `OLLAMA_LLM_LIBRARY` in the server enviroment, eg `OLLAMA_LLM_LIBRARY=cpu_avx2`.
Author
Owner

@AncientMystic commented on GitHub (Aug 11, 2024):

You can force ollama to use specific runners by setting OLLAMA_LLM_LIBRARY in the server enviroment, eg OLLAMA_LLM_LIBRARY=cpu_avx2.

thank you that works, one issue there though is it then stops using cuda or can ollama not use AVX2 in combination with CUDA?

<!-- gh-comment-id:2282821152 --> @AncientMystic commented on GitHub (Aug 11, 2024): > You can force ollama to use specific runners by setting `OLLAMA_LLM_LIBRARY` in the server enviroment, eg `OLLAMA_LLM_LIBRARY=cpu_avx2`. thank you that works, one issue there though is it then stops using cuda or can ollama not use AVX2 in combination with CUDA?
Author
Owner

@rick-github commented on GitHub (Aug 11, 2024):

Not sure I understand the question. ollama starts a runner per model, the hardware available normally dictates which runner is used - if CUDA is available, the cuda runner is used, if only CPU is available, then a CPU runner that supports the appropriate instruction feature set of the CPU is used. If you override the ability of ollama to choose the runner, then only the runner you configure will be used.

<!-- gh-comment-id:2282822633 --> @rick-github commented on GitHub (Aug 11, 2024): Not sure I understand the question. ollama starts a runner per model, the hardware available normally dictates which runner is used - if CUDA is available, the cuda runner is used, if only CPU is available, then a CPU runner that supports the appropriate instruction feature set of the CPU is used. If you override the ability of ollama to choose the runner, then only the runner you configure will be used.
Author
Owner

@AncientMystic commented on GitHub (Aug 11, 2024):

Not sure I understand the question. ollama starts a runner per model, the hardware available normally dictates which runner is used - if CUDA is available, the cuda runner is used, if only CPU is available, then a CPU runner that supports the appropriate instruction feature set of the CPU is used. If you override the ability of ollama to choose the runner, then only the runner you configure will be used.

i was under the impression both were used as ollama uses the cpu even while cuda is being used and reports which avx type etc is used, so i thought getting avx2 or higher working in combination with cuda would be better performance even if just by a few tokens/s

mainly i was hoping for any boost i could get when going over vram amounts as i only have a Tesla P4 8GB and a intel arc a310 4gb (which is not used) but i have like 96GB of ram and models often drop to very low token rates if it is over the vram amount even slightly

Edit: maybe i am misunderstanding how ollama uses cpu in parallel while cuda is enabled haha, i am kind of just trying to find whatever i can to improve performance even slightly, i was reading through PR's and #3468 sounds really useful / like it would have a significant impact on cpu performance much like #6279 on k/v ram usage, but #3468 doesnt include edits to compile under windows and doesnt appear to be updated for the current version of ollama

<!-- gh-comment-id:2282824376 --> @AncientMystic commented on GitHub (Aug 11, 2024): > Not sure I understand the question. ollama starts a runner per model, the hardware available normally dictates which runner is used - if CUDA is available, the cuda runner is used, if only CPU is available, then a CPU runner that supports the appropriate instruction feature set of the CPU is used. If you override the ability of ollama to choose the runner, then only the runner you configure will be used. i was under the impression both were used as ollama uses the cpu even while cuda is being used and reports which avx type etc is used, so i thought getting avx2 or higher working in combination with cuda would be better performance even if just by a few tokens/s mainly i was hoping for any boost i could get when going over vram amounts as i only have a Tesla P4 8GB and a intel arc a310 4gb (which is not used) but i have like 96GB of ram and models often drop to very low token rates if it is over the vram amount even slightly Edit: maybe i am misunderstanding how ollama uses cpu in parallel while cuda is enabled haha, i am kind of just trying to find whatever i can to improve performance even slightly, i was reading through PR's and #3468 sounds really useful / like it would have a significant impact on cpu performance much like #6279 on k/v ram usage, but #3468 doesnt include edits to compile under windows and doesnt appear to be updated for the current version of ollama
Author
Owner

@rick-github commented on GitHub (Aug 11, 2024):

I understand, you want to maximize performance when ollama can't offload all layers to the GPU. I did some tests and I see what you mean, when the cuda runner is executed, it only has the AVX feature enabled. Unfortunately there's no way to enable AVX2 at runtime, this would need a change to the build instructions. Perhaps one of the ollama team can weigh in here.

<!-- gh-comment-id:2282832479 --> @rick-github commented on GitHub (Aug 11, 2024): I understand, you want to maximize performance when ollama can't offload all layers to the GPU. I did some tests and I see what you mean, when the cuda runner is executed, it only has the AVX feature enabled. Unfortunately there's no way to enable AVX2 at runtime, this would need a change to the build instructions. Perhaps one of the ollama team can weigh in here.
Author
Owner

@AncientMystic commented on GitHub (Aug 11, 2024):

I understand, you want to maximize performance when ollama can't offload all layers to the GPU. I did some tests and I see what you mean, when the cuda runner is executed, it only has the AVX feature enabled. Unfortunately there's no way to enable AVX2 at runtime, this would need a change to the build instructions. Perhaps one of the ollama team can weigh in here.

exactly, that is what i am talking about, i am not even sure it would have a significant impact but i wanted to try it to make sure i am at least making a good attempt at using everything to its highest current potential.

<!-- gh-comment-id:2282837446 --> @AncientMystic commented on GitHub (Aug 11, 2024): > I understand, you want to maximize performance when ollama can't offload all layers to the GPU. I did some tests and I see what you mean, when the cuda runner is executed, it only has the AVX feature enabled. Unfortunately there's no way to enable AVX2 at runtime, this would need a change to the build instructions. Perhaps one of the ollama team can weigh in here. exactly, that is what i am talking about, i am not even sure it would have a significant impact but i wanted to try it to make sure i am at least making a good attempt at using everything to its highest current potential.
Author
Owner

@rick-github commented on GitHub (Aug 11, 2024):

I did some quick tests and tokens per second increased by 14% from AVX to AVX2, so enabling other CPU features for the CUDA build seems like a good idea.

<!-- gh-comment-id:2282837738 --> @rick-github commented on GitHub (Aug 11, 2024): I did some quick tests and tokens per second increased by 14% from AVX to AVX2, so enabling other CPU features for the CUDA build seems like a good idea.
Author
Owner

@AncientMystic commented on GitHub (Aug 11, 2024):

I did some quick tests and tokens per second increased by 14% from AVX to AVX2, so enabling other CPU features for the CUDA build seems like a good idea.

that definitely sounds worth it, i know i see a slight increase on cpu only, i would like to get avx512 working too but even just avx2 would be good

<!-- gh-comment-id:2282839049 --> @AncientMystic commented on GitHub (Aug 11, 2024): > I did some quick tests and tokens per second increased by 14% from AVX to AVX2, so enabling other CPU features for the CUDA build seems like a good idea. that definitely sounds worth it, i know i see a slight increase on cpu only, i would like to get avx512 working too but even just avx2 would be good
Author
Owner

@rick-github commented on GitHub (Aug 11, 2024):

I built a version of the CUDA driver with AVX2 and did a test against stock 0.3.4. Model qwen2:0.5b, prompt "why is the sky blue?", RTX4070.

baseline CPU performance in both versions: 93 tokens per second (cpu_avx2 runner)
baseline GPU performance in both versions: 287 tps (cuda runner)
1 of 25 layers in GPU: 0.3.4 = 83 tps, 0.3.4+avx2 = 91 tps, 9.6% improvement
12 of 25 layers in GPU: 0.3.4 = 100 tps, 0.3.4+avx2 = 108 tps, 8% improvement
24 of 25 layers in GPU: 0.3.4 = 142 tps, 0.3.4+avx2 = 146 tps, 2.8% improvement

<!-- gh-comment-id:2282932199 --> @rick-github commented on GitHub (Aug 11, 2024): I built a version of the CUDA driver with AVX2 and did a test against stock 0.3.4. Model qwen2:0.5b, prompt "why is the sky blue?", RTX4070. baseline CPU performance in both versions: 93 tokens per second (cpu_avx2 runner) baseline GPU performance in both versions: 287 tps (cuda runner) 1 of 25 layers in GPU: 0.3.4 = 83 tps, 0.3.4+avx2 = 91 tps, 9.6% improvement 12 of 25 layers in GPU: 0.3.4 = 100 tps, 0.3.4+avx2 = 108 tps, 8% improvement 24 of 25 layers in GPU: 0.3.4 = 142 tps, 0.3.4+avx2 = 146 tps, 2.8% improvement
Author
Owner

@AncientMystic commented on GitHub (Aug 12, 2024):

I built a version of the CUDA driver with AVX2 and did a test against stock 0.3.4. Model qwen2:0.5b, prompt "why is the sky blue?", RTX4070.

baseline CPU performance in both versions: 93 tokens per second (cpu_avx2 runner) baseline GPU performance in both versions: 287 tps (cuda runner) 1 of 25 layers in GPU: 0.3.4 = 83 tps, 0.3.4+avx2 = 91 tps, 9.6% improvement 12 of 25 layers in GPU: 0.3.4 = 100 tps, 0.3.4+avx2 = 108 tps, 8% improvement 24 of 25 layers in GPU: 0.3.4 = 142 tps, 0.3.4+avx2 = 146 tps, 2.8% improvement

that seems like a pretty decent difference, what file(s) would i edit to compile a cuda version with avx2? id also like to give avx512 a try here

<!-- gh-comment-id:2282937960 --> @AncientMystic commented on GitHub (Aug 12, 2024): > I built a version of the CUDA driver with AVX2 and did a test against stock 0.3.4. Model qwen2:0.5b, prompt "why is the sky blue?", RTX4070. > > baseline CPU performance in both versions: 93 tokens per second (cpu_avx2 runner) baseline GPU performance in both versions: 287 tps (cuda runner) 1 of 25 layers in GPU: 0.3.4 = 83 tps, 0.3.4+avx2 = 91 tps, 9.6% improvement 12 of 25 layers in GPU: 0.3.4 = 100 tps, 0.3.4+avx2 = 108 tps, 8% improvement 24 of 25 layers in GPU: 0.3.4 = 142 tps, 0.3.4+avx2 = 146 tps, 2.8% improvement that seems like a pretty decent difference, what file(s) would i edit to compile a cuda version with avx2? id also like to give avx512 a try here
Author
Owner

@rick-github commented on GitHub (Aug 12, 2024):

On a linux system using docker:

--- a/Dockerfile
+++ b/Dockerfile
@@ -18,7 +18,7 @@ ENV PATH /opt/rh/devtoolset-10/root/usr/bin:$PATH
 COPY --from=llm-code / /go/src/github.com/ollama/ollama/
 WORKDIR /go/src/github.com/ollama/ollama/llm/generate
 ARG CGO_CFLAGS
-RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 sh gen_linux.sh
+RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 OLLAMA_CUSTOM_CUDA_DEFS="-DGGML_AVX2=on -DGGML_FMA=on -DGGML_F16C=on" sh gen_linux.sh
 
 FROM --platform=linux/arm64 nvidia/cuda:$CUDA_VERSION-devel-rockylinux8 AS cuda-build-arm64
 ARG CMAKE_VERSION

I think for windows you would edit llm/generate/gen_windows.ps1 and at line 275 change -DGGML_AVX2=off to -DGGML_AVX512=on. For axv2, there are additional arguments (-DGGML_FMA=on -DGGML_F16C=on), I don't know if you need to include them for avx512.

<!-- gh-comment-id:2282957817 --> @rick-github commented on GitHub (Aug 12, 2024): On a linux system using docker: ```diff --- a/Dockerfile +++ b/Dockerfile @@ -18,7 +18,7 @@ ENV PATH /opt/rh/devtoolset-10/root/usr/bin:$PATH COPY --from=llm-code / /go/src/github.com/ollama/ollama/ WORKDIR /go/src/github.com/ollama/ollama/llm/generate ARG CGO_CFLAGS -RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 sh gen_linux.sh +RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 OLLAMA_CUSTOM_CUDA_DEFS="-DGGML_AVX2=on -DGGML_FMA=on -DGGML_F16C=on" sh gen_linux.sh FROM --platform=linux/arm64 nvidia/cuda:$CUDA_VERSION-devel-rockylinux8 AS cuda-build-arm64 ARG CMAKE_VERSION ``` I think for windows you would edit `llm/generate/gen_windows.ps1` and at line 275 change `-DGGML_AVX2=off` to `-DGGML_AVX512=on`. For axv2, there are additional arguments (`-DGGML_FMA=on -DGGML_F16C=on`), I don't know if you need to include them for avx512.
Author
Owner

@AncientMystic commented on GitHub (Aug 12, 2024):

On a linux system using docker:


--- a/Dockerfile

+++ b/Dockerfile

@@ -18,7 +18,7 @@ ENV PATH /opt/rh/devtoolset-10/root/usr/bin:$PATH

 COPY --from=llm-code / /go/src/github.com/ollama/ollama/

 WORKDIR /go/src/github.com/ollama/ollama/llm/generate

 ARG CGO_CFLAGS

-RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 sh gen_linux.sh

+RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 OLLAMA_CUSTOM_CUDA_DEFS="-DGGML_AVX2=on -DGGML_FMA=on -DGGML_F16C=on" sh gen_linux.sh

 

 FROM --platform=linux/arm64 nvidia/cuda:$CUDA_VERSION-devel-rockylinux8 AS cuda-build-arm64

 ARG CMAKE_VERSION

I think for windows you would edit llm/generate/gen_windows.ps1 and at line 275 change -DGGML_AVX2=off to -DGGML_AVX512=on. For axv2, there are additional arguments (-DGGML_FMA=on -DGGML_F16C=on), I don't know if you need to include them for avx512.

Awesome, thank you very much, compiling now, i am giving avx2 a try then once i see how that goes i will see if i can make avx512 work and test the differences.

<!-- gh-comment-id:2282972155 --> @AncientMystic commented on GitHub (Aug 12, 2024): > On a linux system using docker: > > ```diff > > --- a/Dockerfile > > +++ b/Dockerfile > > @@ -18,7 +18,7 @@ ENV PATH /opt/rh/devtoolset-10/root/usr/bin:$PATH > > COPY --from=llm-code / /go/src/github.com/ollama/ollama/ > > WORKDIR /go/src/github.com/ollama/ollama/llm/generate > > ARG CGO_CFLAGS > > -RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 sh gen_linux.sh > > +RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 OLLAMA_CUSTOM_CUDA_DEFS="-DGGML_AVX2=on -DGGML_FMA=on -DGGML_F16C=on" sh gen_linux.sh > > > > FROM --platform=linux/arm64 nvidia/cuda:$CUDA_VERSION-devel-rockylinux8 AS cuda-build-arm64 > > ARG CMAKE_VERSION > > ``` > > > > I think for windows you would edit `llm/generate/gen_windows.ps1` and at line 275 change `-DGGML_AVX2=off` to `-DGGML_AVX512=on`. For axv2, there are additional arguments (`-DGGML_FMA=on -DGGML_F16C=on`), I don't know if you need to include them for avx512. Awesome, thank you very much, compiling now, i am giving avx2 a try then once i see how that goes i will see if i can make avx512 work and test the differences.
Author
Owner

@AncientMystic commented on GitHub (Aug 12, 2024):

Tested CUDA+AVX2 a bit, seems a slight token increase across the board, especially on larger models going over vram

but it also seems to have another significant impact, i am noticing less of a pause between chunks being generated on larger models

before it would pause for second between words, now it seems to be generating larger chunks and the pause between them is so short its almost non existent, before it was up to 1-3 seconds so this results in a much faster and more so smooth response from any model, even the very large models that run at like 1 token/s it feels smoother and more usable now.

Still trying to get cpu avx512 & cuda+avx512 compiled, does not seem to want to do it on windows.

<!-- gh-comment-id:2283089816 --> @AncientMystic commented on GitHub (Aug 12, 2024): Tested CUDA+AVX2 a bit, seems a slight token increase across the board, especially on larger models going over vram but it also seems to have another significant impact, i am noticing less of a pause between chunks being generated on larger models before it would pause for second between words, now it seems to be generating larger chunks and the pause between them is so short its almost non existent, before it was up to 1-3 seconds so this results in a much faster and more so smooth response from any model, even the very large models that run at like 1 token/s it feels smoother and more usable now. Still trying to get cpu avx512 & cuda+avx512 compiled, does not seem to want to do it on windows.
Author
Owner

@AncientMystic commented on GitHub (Aug 12, 2024):

i have successfully gotten the cpu_avx512 runner and cuda+avx512 built and running

i added at line 259 (now occupying lines 260-274)

function build_cpu_avx512() {
    if ((-not "${env:OLLAMA_SKIP_CPU_GENERATE}" ) -and ((-not "${env:OLLAMA_CPU_TARGET}") -or ("${env:OLLAMA_CPU_TARGET}" -eq "cpu_avx512"))) {
        init_vars
        $script:cmakeDefs = $script:commonCpuDefs + @("-A", "x64", "-DGGML_AVX=on", "-DGGML_AVX2=on", "-DGGML_AVX512=on", "-DGGML_FMA=on", "-DGGML_F16C=on") + $script:cmakeDefs
        $script:buildDir="../build/windows/${script:ARCH}/cpu_avx512"
        $script:distDir="$script:DIST_BASE\cpu_avx512"
        write-host "Building AVX512 CPU"
        build
        sign
        install
    } else {
        write-host "Skipping CPU AVX512 generation step as requested"
    }
}

and build_cpu_avx512 now occupying line 432 directly under build_cpu_avx2

then ran the commands

$env:OLLAMA_CPU_TARGET="cpu_avx512"
go generate ./...
go build .

results seem to be up to 10-20% token increase on some models (more so equal to cuda+avx2 on others), plus even smoother generation speed, the pause between chunks seems extremely small now to the point if it was just a little lower it would be hard to notice if it was even pausing at all, this is a significant improvement over the standard CUDA+AVX where the pause at low token rates with part of the model offloaded to the cpu is so long you are waiting twice as long for the response to finish

EDIT: just noticed too cpu usage seems to have dropped dramatically, it is currently hitting 40% it was often 90-98% when ollama was generating before with CUDA+AVX, so this will save power too

<!-- gh-comment-id:2284170045 --> @AncientMystic commented on GitHub (Aug 12, 2024): i have successfully gotten the `cpu_avx512` runner and `cuda+avx512` built and running i added at line 259 (now occupying lines 260-274) ``` function build_cpu_avx512() { if ((-not "${env:OLLAMA_SKIP_CPU_GENERATE}" ) -and ((-not "${env:OLLAMA_CPU_TARGET}") -or ("${env:OLLAMA_CPU_TARGET}" -eq "cpu_avx512"))) { init_vars $script:cmakeDefs = $script:commonCpuDefs + @("-A", "x64", "-DGGML_AVX=on", "-DGGML_AVX2=on", "-DGGML_AVX512=on", "-DGGML_FMA=on", "-DGGML_F16C=on") + $script:cmakeDefs $script:buildDir="../build/windows/${script:ARCH}/cpu_avx512" $script:distDir="$script:DIST_BASE\cpu_avx512" write-host "Building AVX512 CPU" build sign install } else { write-host "Skipping CPU AVX512 generation step as requested" } } ``` and `build_cpu_avx512` now occupying line 432 directly under build_cpu_avx2 then ran the commands `$env:OLLAMA_CPU_TARGET="cpu_avx512"` `go generate ./...` `go build .` results seem to be up to 10-20% token increase on some models (more so equal to cuda+avx2 on others), plus even smoother generation speed, the pause between chunks seems extremely small now to the point if it was just a little lower it would be hard to notice if it was even pausing at all, this is a significant improvement over the standard CUDA+AVX where the pause at low token rates with part of the model offloaded to the cpu is so long you are waiting twice as long for the response to finish EDIT: just noticed too cpu usage seems to have dropped dramatically, it is currently hitting 40% it was often 90-98% when ollama was generating before with CUDA+AVX, so this will save power too
Author
Owner

@dhiltgen commented on GitHub (Aug 18, 2024):

We've been holding off on avx512 based on our initial performance tests showing a very minimal performance improvement, and the desire to avoid too much sprawl in the permutations of runners we bundle. With #5049 we'll be adding cuda v12, which will let us enable features that don't exist in v11 for newer GPUs, but there's also a potential to consider refining the CPU flags we use for the v12 runner since we can still fall back to v11 with AVX to support older (or server) CPUs that lack the vector extensions.

Our intent is to make it straight forward for users to build from source and tune the CPU flags for the GPU runners, but it's still a bit rough around the edges. If you bump into bugs in setting OLLAMA_CUSTOM_CUDA_DEFS that prevent this from working, PRs are welcome. Just CC me so I can review them.

<!-- gh-comment-id:2295320574 --> @dhiltgen commented on GitHub (Aug 18, 2024): We've been holding off on avx512 based on our initial performance tests showing a very minimal performance improvement, and the desire to avoid too much sprawl in the permutations of runners we bundle. With #5049 we'll be adding cuda v12, which will let us enable features that don't exist in v11 for newer GPUs, but there's also a potential to consider refining the CPU flags we use for the v12 runner since we can still fall back to v11 with AVX to support older (or server) CPUs that lack the vector extensions. Our intent is to make it straight forward for users to build from source and tune the CPU flags for the GPU runners, but it's still a bit rough around the edges. If you bump into bugs in setting OLLAMA_CUSTOM_CUDA_DEFS that prevent this from working, PRs are welcome. Just CC me so I can review them.
Author
Owner

@AncientMystic commented on GitHub (Aug 18, 2024):

We've been holding off on avx512 based on our initial performance tests showing a very minimal performance improvement, and the desire to avoid too much sprawl in the permutations of runners we bundle. With #5049 we'll be adding cuda v12, which will let us enable features that don't exist in v11 for newer GPUs, but there's also a potential to consider refining the CPU flags we use for the v12 runner since we can still fall back to v11 with AVX to support older (or server) CPUs that lack the vector extensions.

Our intent is to make it straight forward for users to build from source and tune the CPU flags for the GPU runners, but it's still a bit rough around the edges. If you bump into bugs in setting OLLAMA_CUSTOM_CUDA_DEFS that prevent this from working, PRs are welcome. Just CC me so I can review them.

I can definitely understand not adding 512 by default, there isn't a lot of support for it on cpus, so far the biggest impact im noticing is that with cuda+avx512 it seems to handle larger models more smoothly and at lower cpu usage.

Example i was getting 0.8-1.2t/s with 90-99% cpu usage
I have been testing 8x7b models now (46.7b 25-30gb in size) and i am seeing 3-3.6t/s with no pausing and only 40% cpu usage. (Cuda+avx once anything was on the cpu it seems to pause longer the larger the model and the pausing between words makes it horrible to use)

Which i think cuda+avx2 has the majority of a performance boost to it, avx512 only seems to add a tiny boost on a few models but overall is about the same as you said besides being a little bit smoother, it seems the biggest improvement i noticed with 512 was the massive drop in cpu usage which is a big deal.

I have also been using the KV cache PR #6279 it has also helped significantly with running larger models

I have also been trying to add oneapi to the mix (for my 4gb arc gpu) but cannot seem to get it to compile...... i keep running into errors relating to kernel32.lib and other libraries.

Are there any other PRs optional features or anything that can be enabled/added to increase performance?

I was looking at PR #3468 and the NeuralSpeed backend for skylake or newer seems to have a 7.27x performance increase which would be really amazing but the PR seems to be abandoned. (And doesn't include windows)

(Running a i7-7820x with 96gb ram+Tesla p4 8gb and Arc 310 4gb)

<!-- gh-comment-id:2295335472 --> @AncientMystic commented on GitHub (Aug 18, 2024): > We've been holding off on avx512 based on our initial performance tests showing a very minimal performance improvement, and the desire to avoid too much sprawl in the permutations of runners we bundle. With #5049 we'll be adding cuda v12, which will let us enable features that don't exist in v11 for newer GPUs, but there's also a potential to consider refining the CPU flags we use for the v12 runner since we can still fall back to v11 with AVX to support older (or server) CPUs that lack the vector extensions. > > > > Our intent is to make it straight forward for users to build from source and tune the CPU flags for the GPU runners, but it's still a bit rough around the edges. If you bump into bugs in setting OLLAMA_CUSTOM_CUDA_DEFS that prevent this from working, PRs are welcome. Just CC me so I can review them. I can definitely understand not adding 512 by default, there isn't a lot of support for it on cpus, so far the biggest impact im noticing is that with cuda+avx512 it seems to handle larger models more smoothly and at lower cpu usage. Example i was getting 0.8-1.2t/s with 90-99% cpu usage I have been testing 8x7b models now (46.7b 25-30gb in size) and i am seeing 3-3.6t/s with no pausing and only 40% cpu usage. (Cuda+avx once anything was on the cpu it seems to pause longer the larger the model and the pausing between words makes it horrible to use) Which i think cuda+avx2 has the majority of a performance boost to it, avx512 only seems to add a tiny boost on a few models but overall is about the same as you said besides being a little bit smoother, it seems the biggest improvement i noticed with 512 was the massive drop in cpu usage which is a big deal. I have also been using the KV cache PR #6279 it has also helped significantly with running larger models I have also been trying to add oneapi to the mix (for my 4gb arc gpu) but cannot seem to get it to compile...... i keep running into errors relating to kernel32.lib and other libraries. Are there any other PRs optional features or anything that can be enabled/added to increase performance? I was looking at PR #3468 and the NeuralSpeed backend for skylake or newer seems to have a 7.27x performance increase which would be really amazing but the PR seems to be abandoned. (And doesn't include windows) (Running a i7-7820x with 96gb ram+Tesla p4 8gb and Arc 310 4gb)
Author
Owner

@dhiltgen commented on GitHub (Apr 9, 2025):

The new cmake based build now builds multiple CPU optimized libraries distinct from the GPU libraries, including avx512 support.

<!-- gh-comment-id:2790763482 --> @dhiltgen commented on GitHub (Apr 9, 2025): The new cmake based build now builds multiple CPU optimized libraries distinct from the GPU libraries, including avx512 support.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65996