[GH-ISSUE #6312] how to force ollama to use different cpu runners / how to compile windows avx512 runner? #65996

New Issue

GiteaMirror · 2026-05-03T23:29:10-05:00

GiteaMirror commented

2026-05-03 23:29:10 -05:00

Originally created by @AncientMystic on GitHub (Aug 11, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6312

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

according to logs ollama seems to only be using AVX not AVX2, how would i fix this and force avx2 or higher?

also wondering how i compile the avx512 runner for windows? i have compiled other runners and the cuda runner fine but it seems regardless of what i try to set it just generates cpu, avx, avx2 and stops

running $env:OLLAMA_CUSTOM_CPU_DEFS="-DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on" or any other combo of this command seems to have no effect on generating a different runner aside from defaults, is there a file i need to edit to change what is compiled? i would also like to skip cpu, avx, avx2 if possible as it takes a fair amount of time and those are already compiled

cpu is i7-7820x with support for avx2 & avx512

OS

Windows

GPU

Nvidia, Intel

CPU

Intel

Ollama version

0.3.4

Originally created by @AncientMystic on GitHub (Aug 11, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6312 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? according to logs ollama seems to only be using AVX not AVX2, how would i fix this and force avx2 or higher? also wondering how i compile the avx512 runner for windows? i have compiled other runners and the cuda runner fine but it seems regardless of what i try to set it just generates cpu, avx, avx2 and stops running `$env:OLLAMA_CUSTOM_CPU_DEFS="-DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on"` or any other combo of this command seems to have no effect on generating a different runner aside from defaults, is there a file i need to edit to change what is compiled? i would also like to skip cpu, avx, avx2 if possible as it takes a fair amount of time and those are already compiled cpu is i7-7820x with support for avx2 & avx512 ### OS Windows ### GPU Nvidia, Intel ### CPU Intel ### Ollama version 0.3.4

GiteaMirror added the performance build nvidia feature request labels 2026-05-03 23:29:14 -05:00

GiteaMirror closed this issue

2026-05-03 23:29:43 -05:00

GiteaMirror commented

2026-05-03 23:29:45 -05:00

@rick-github commented on GitHub (Aug 11, 2024):

You can force ollama to use specific runners by setting OLLAMA_LLM_LIBRARY in the server enviroment, eg OLLAMA_LLM_LIBRARY=cpu_avx2.

@rick-github commented on GitHub (Aug 11, 2024): You can force ollama to use specific runners by setting `OLLAMA_LLM_LIBRARY` in the server enviroment, eg `OLLAMA_LLM_LIBRARY=cpu_avx2`.

GiteaMirror commented

2026-05-03 23:29:45 -05:00

@AncientMystic commented on GitHub (Aug 11, 2024):

You can force ollama to use specific runners by setting OLLAMA_LLM_LIBRARY in the server enviroment, eg OLLAMA_LLM_LIBRARY=cpu_avx2.

thank you that works, one issue there though is it then stops using cuda or can ollama not use AVX2 in combination with CUDA?

@AncientMystic commented on GitHub (Aug 11, 2024): > You can force ollama to use specific runners by setting `OLLAMA_LLM_LIBRARY` in the server enviroment, eg `OLLAMA_LLM_LIBRARY=cpu_avx2`. thank you that works, one issue there though is it then stops using cuda or can ollama not use AVX2 in combination with CUDA?

GiteaMirror commented

2026-05-03 23:29:46 -05:00

@rick-github commented on GitHub (Aug 11, 2024):

Not sure I understand the question. ollama starts a runner per model, the hardware available normally dictates which runner is used - if CUDA is available, the cuda runner is used, if only CPU is available, then a CPU runner that supports the appropriate instruction feature set of the CPU is used. If you override the ability of ollama to choose the runner, then only the runner you configure will be used.

@rick-github commented on GitHub (Aug 11, 2024): Not sure I understand the question. ollama starts a runner per model, the hardware available normally dictates which runner is used - if CUDA is available, the cuda runner is used, if only CPU is available, then a CPU runner that supports the appropriate instruction feature set of the CPU is used. If you override the ability of ollama to choose the runner, then only the runner you configure will be used.

GiteaMirror commented

2026-05-03 23:29:47 -05:00

@AncientMystic commented on GitHub (Aug 11, 2024):

Not sure I understand the question. ollama starts a runner per model, the hardware available normally dictates which runner is used - if CUDA is available, the cuda runner is used, if only CPU is available, then a CPU runner that supports the appropriate instruction feature set of the CPU is used. If you override the ability of ollama to choose the runner, then only the runner you configure will be used.

i was under the impression both were used as ollama uses the cpu even while cuda is being used and reports which avx type etc is used, so i thought getting avx2 or higher working in combination with cuda would be better performance even if just by a few tokens/s

mainly i was hoping for any boost i could get when going over vram amounts as i only have a Tesla P4 8GB and a intel arc a310 4gb (which is not used) but i have like 96GB of ram and models often drop to very low token rates if it is over the vram amount even slightly

Edit: maybe i am misunderstanding how ollama uses cpu in parallel while cuda is enabled haha, i am kind of just trying to find whatever i can to improve performance even slightly, i was reading through PR's and #3468 sounds really useful / like it would have a significant impact on cpu performance much like #6279 on k/v ram usage, but #3468 doesnt include edits to compile under windows and doesnt appear to be updated for the current version of ollama

@AncientMystic commented on GitHub (Aug 11, 2024): > Not sure I understand the question. ollama starts a runner per model, the hardware available normally dictates which runner is used - if CUDA is available, the cuda runner is used, if only CPU is available, then a CPU runner that supports the appropriate instruction feature set of the CPU is used. If you override the ability of ollama to choose the runner, then only the runner you configure will be used. i was under the impression both were used as ollama uses the cpu even while cuda is being used and reports which avx type etc is used, so i thought getting avx2 or higher working in combination with cuda would be better performance even if just by a few tokens/s mainly i was hoping for any boost i could get when going over vram amounts as i only have a Tesla P4 8GB and a intel arc a310 4gb (which is not used) but i have like 96GB of ram and models often drop to very low token rates if it is over the vram amount even slightly Edit: maybe i am misunderstanding how ollama uses cpu in parallel while cuda is enabled haha, i am kind of just trying to find whatever i can to improve performance even slightly, i was reading through PR's and #3468 sounds really useful / like it would have a significant impact on cpu performance much like #6279 on k/v ram usage, but #3468 doesnt include edits to compile under windows and doesnt appear to be updated for the current version of ollama

GiteaMirror commented

2026-05-03 23:29:48 -05:00

@rick-github commented on GitHub (Aug 11, 2024):

I understand, you want to maximize performance when ollama can't offload all layers to the GPU. I did some tests and I see what you mean, when the cuda runner is executed, it only has the AVX feature enabled. Unfortunately there's no way to enable AVX2 at runtime, this would need a change to the build instructions. Perhaps one of the ollama team can weigh in here.

@rick-github commented on GitHub (Aug 11, 2024): I understand, you want to maximize performance when ollama can't offload all layers to the GPU. I did some tests and I see what you mean, when the cuda runner is executed, it only has the AVX feature enabled. Unfortunately there's no way to enable AVX2 at runtime, this would need a change to the build instructions. Perhaps one of the ollama team can weigh in here.

GiteaMirror commented

2026-05-03 23:29:48 -05:00

@AncientMystic commented on GitHub (Aug 11, 2024):

I understand, you want to maximize performance when ollama can't offload all layers to the GPU. I did some tests and I see what you mean, when the cuda runner is executed, it only has the AVX feature enabled. Unfortunately there's no way to enable AVX2 at runtime, this would need a change to the build instructions. Perhaps one of the ollama team can weigh in here.

exactly, that is what i am talking about, i am not even sure it would have a significant impact but i wanted to try it to make sure i am at least making a good attempt at using everything to its highest current potential.

@AncientMystic commented on GitHub (Aug 11, 2024): > I understand, you want to maximize performance when ollama can't offload all layers to the GPU. I did some tests and I see what you mean, when the cuda runner is executed, it only has the AVX feature enabled. Unfortunately there's no way to enable AVX2 at runtime, this would need a change to the build instructions. Perhaps one of the ollama team can weigh in here. exactly, that is what i am talking about, i am not even sure it would have a significant impact but i wanted to try it to make sure i am at least making a good attempt at using everything to its highest current potential.

GiteaMirror commented

2026-05-03 23:29:49 -05:00

@rick-github commented on GitHub (Aug 11, 2024):

I did some quick tests and tokens per second increased by 14% from AVX to AVX2, so enabling other CPU features for the CUDA build seems like a good idea.

@rick-github commented on GitHub (Aug 11, 2024): I did some quick tests and tokens per second increased by 14% from AVX to AVX2, so enabling other CPU features for the CUDA build seems like a good idea.

GiteaMirror commented

2026-05-03 23:29:50 -05:00

@AncientMystic commented on GitHub (Aug 11, 2024):

I did some quick tests and tokens per second increased by 14% from AVX to AVX2, so enabling other CPU features for the CUDA build seems like a good idea.

that definitely sounds worth it, i know i see a slight increase on cpu only, i would like to get avx512 working too but even just avx2 would be good

@AncientMystic commented on GitHub (Aug 11, 2024): > I did some quick tests and tokens per second increased by 14% from AVX to AVX2, so enabling other CPU features for the CUDA build seems like a good idea. that definitely sounds worth it, i know i see a slight increase on cpu only, i would like to get avx512 working too but even just avx2 would be good

GiteaMirror commented

2026-05-03 23:29:51 -05:00

@rick-github commented on GitHub (Aug 11, 2024):

I built a version of the CUDA driver with AVX2 and did a test against stock 0.3.4. Model qwen2:0.5b, prompt "why is the sky blue?", RTX4070.

baseline CPU performance in both versions: 93 tokens per second (cpu_avx2 runner)
baseline GPU performance in both versions: 287 tps (cuda runner)
1 of 25 layers in GPU: 0.3.4 = 83 tps, 0.3.4+avx2 = 91 tps, 9.6% improvement
12 of 25 layers in GPU: 0.3.4 = 100 tps, 0.3.4+avx2 = 108 tps, 8% improvement
24 of 25 layers in GPU: 0.3.4 = 142 tps, 0.3.4+avx2 = 146 tps, 2.8% improvement

@rick-github commented on GitHub (Aug 11, 2024): I built a version of the CUDA driver with AVX2 and did a test against stock 0.3.4. Model qwen2:0.5b, prompt "why is the sky blue?", RTX4070. baseline CPU performance in both versions: 93 tokens per second (cpu_avx2 runner) baseline GPU performance in both versions: 287 tps (cuda runner) 1 of 25 layers in GPU: 0.3.4 = 83 tps, 0.3.4+avx2 = 91 tps, 9.6% improvement 12 of 25 layers in GPU: 0.3.4 = 100 tps, 0.3.4+avx2 = 108 tps, 8% improvement 24 of 25 layers in GPU: 0.3.4 = 142 tps, 0.3.4+avx2 = 146 tps, 2.8% improvement

GiteaMirror commented

2026-05-03 23:29:52 -05:00

@AncientMystic commented on GitHub (Aug 12, 2024):

I built a version of the CUDA driver with AVX2 and did a test against stock 0.3.4. Model qwen2:0.5b, prompt "why is the sky blue?", RTX4070.

baseline CPU performance in both versions: 93 tokens per second (cpu_avx2 runner) baseline GPU performance in both versions: 287 tps (cuda runner) 1 of 25 layers in GPU: 0.3.4 = 83 tps, 0.3.4+avx2 = 91 tps, 9.6% improvement 12 of 25 layers in GPU: 0.3.4 = 100 tps, 0.3.4+avx2 = 108 tps, 8% improvement 24 of 25 layers in GPU: 0.3.4 = 142 tps, 0.3.4+avx2 = 146 tps, 2.8% improvement

that seems like a pretty decent difference, what file(s) would i edit to compile a cuda version with avx2? id also like to give avx512 a try here

@AncientMystic commented on GitHub (Aug 12, 2024): > I built a version of the CUDA driver with AVX2 and did a test against stock 0.3.4. Model qwen2:0.5b, prompt "why is the sky blue?", RTX4070. > > baseline CPU performance in both versions: 93 tokens per second (cpu_avx2 runner) baseline GPU performance in both versions: 287 tps (cuda runner) 1 of 25 layers in GPU: 0.3.4 = 83 tps, 0.3.4+avx2 = 91 tps, 9.6% improvement 12 of 25 layers in GPU: 0.3.4 = 100 tps, 0.3.4+avx2 = 108 tps, 8% improvement 24 of 25 layers in GPU: 0.3.4 = 142 tps, 0.3.4+avx2 = 146 tps, 2.8% improvement that seems like a pretty decent difference, what file(s) would i edit to compile a cuda version with avx2? id also like to give avx512 a try here

GiteaMirror commented

2026-05-03 23:29:53 -05:00

@rick-github commented on GitHub (Aug 12, 2024):

On a linux system using docker:

--- a/Dockerfile
+++ b/Dockerfile
@@ -18,7 +18,7 @@ ENV PATH /opt/rh/devtoolset-10/root/usr/bin:$PATH
 COPY --from=llm-code / /go/src/github.com/ollama/ollama/
 WORKDIR /go/src/github.com/ollama/ollama/llm/generate
 ARG CGO_CFLAGS
-RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 sh gen_linux.sh
+RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 OLLAMA_CUSTOM_CUDA_DEFS="-DGGML_AVX2=on -DGGML_FMA=on -DGGML_F16C=on" sh gen_linux.sh
 
 FROM --platform=linux/arm64 nvidia/cuda:$CUDA_VERSION-devel-rockylinux8 AS cuda-build-arm64
 ARG CMAKE_VERSION

I think for windows you would edit llm/generate/gen_windows.ps1 and at line 275 change -DGGML_AVX2=off to -DGGML_AVX512=on. For axv2, there are additional arguments (-DGGML_FMA=on -DGGML_F16C=on), I don't know if you need to include them for avx512.

@rick-github commented on GitHub (Aug 12, 2024): On a linux system using docker: ```diff --- a/Dockerfile +++ b/Dockerfile @@ -18,7 +18,7 @@ ENV PATH /opt/rh/devtoolset-10/root/usr/bin:$PATH COPY --from=llm-code / /go/src/github.com/ollama/ollama/ WORKDIR /go/src/github.com/ollama/ollama/llm/generate ARG CGO_CFLAGS -RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 sh gen_linux.sh +RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 OLLAMA_CUSTOM_CUDA_DEFS="-DGGML_AVX2=on -DGGML_FMA=on -DGGML_F16C=on" sh gen_linux.sh FROM --platform=linux/arm64 nvidia/cuda:$CUDA_VERSION-devel-rockylinux8 AS cuda-build-arm64 ARG CMAKE_VERSION ``` I think for windows you would edit `llm/generate/gen_windows.ps1` and at line 275 change `-DGGML_AVX2=off` to `-DGGML_AVX512=on`. For axv2, there are additional arguments (`-DGGML_FMA=on -DGGML_F16C=on`), I don't know if you need to include them for avx512.

GiteaMirror commented

2026-05-03 23:29:54 -05:00

@AncientMystic commented on GitHub (Aug 12, 2024):

On a linux system using docker:
--- a/Dockerfile

+++ b/Dockerfile

@@ -18,7 +18,7 @@ ENV PATH /opt/rh/devtoolset-10/root/usr/bin:$PATH

 COPY --from=llm-code / /go/src/github.com/ollama/ollama/

 WORKDIR /go/src/github.com/ollama/ollama/llm/generate

 ARG CGO_CFLAGS

-RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 sh gen_linux.sh

+RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 OLLAMA_CUSTOM_CUDA_DEFS="-DGGML_AVX2=on -DGGML_FMA=on -DGGML_F16C=on" sh gen_linux.sh

 

 FROM --platform=linux/arm64 nvidia/cuda:$CUDA_VERSION-devel-rockylinux8 AS cuda-build-arm64

 ARG CMAKE_VERSION
I think for windows you would edit llm/generate/gen_windows.ps1 and at line 275 change -DGGML_AVX2=off to -DGGML_AVX512=on. For axv2, there are additional arguments (-DGGML_FMA=on -DGGML_F16C=on), I don't know if you need to include them for avx512.

Awesome, thank you very much, compiling now, i am giving avx2 a try then once i see how that goes i will see if i can make avx512 work and test the differences.

@AncientMystic commented on GitHub (Aug 12, 2024): > On a linux system using docker: > > ```diff > > --- a/Dockerfile > > +++ b/Dockerfile > > @@ -18,7 +18,7 @@ ENV PATH /opt/rh/devtoolset-10/root/usr/bin:$PATH > > COPY --from=llm-code / /go/src/github.com/ollama/ollama/ > > WORKDIR /go/src/github.com/ollama/ollama/llm/generate > > ARG CGO_CFLAGS > > -RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 sh gen_linux.sh > > +RUN OLLAMA_SKIP_STATIC_GENERATE=1 OLLAMA_SKIP_CPU_GENERATE=1 OLLAMA_CUSTOM_CUDA_DEFS="-DGGML_AVX2=on -DGGML_FMA=on -DGGML_F16C=on" sh gen_linux.sh > > > > FROM --platform=linux/arm64 nvidia/cuda:$CUDA_VERSION-devel-rockylinux8 AS cuda-build-arm64 > > ARG CMAKE_VERSION > > ``` > > > > I think for windows you would edit `llm/generate/gen_windows.ps1` and at line 275 change `-DGGML_AVX2=off` to `-DGGML_AVX512=on`. For axv2, there are additional arguments (`-DGGML_FMA=on -DGGML_F16C=on`), I don't know if you need to include them for avx512. Awesome, thank you very much, compiling now, i am giving avx2 a try then once i see how that goes i will see if i can make avx512 work and test the differences.

GiteaMirror commented

2026-05-03 23:29:54 -05:00

@AncientMystic commented on GitHub (Aug 12, 2024):

Tested CUDA+AVX2 a bit, seems a slight token increase across the board, especially on larger models going over vram

but it also seems to have another significant impact, i am noticing less of a pause between chunks being generated on larger models

before it would pause for second between words, now it seems to be generating larger chunks and the pause between them is so short its almost non existent, before it was up to 1-3 seconds so this results in a much faster and more so smooth response from any model, even the very large models that run at like 1 token/s it feels smoother and more usable now.

Still trying to get cpu avx512 & cuda+avx512 compiled, does not seem to want to do it on windows.

@AncientMystic commented on GitHub (Aug 12, 2024): Tested CUDA+AVX2 a bit, seems a slight token increase across the board, especially on larger models going over vram but it also seems to have another significant impact, i am noticing less of a pause between chunks being generated on larger models before it would pause for second between words, now it seems to be generating larger chunks and the pause between them is so short its almost non existent, before it was up to 1-3 seconds so this results in a much faster and more so smooth response from any model, even the very large models that run at like 1 token/s it feels smoother and more usable now. Still trying to get cpu avx512 & cuda+avx512 compiled, does not seem to want to do it on windows.

GiteaMirror commented

2026-05-03 23:29:55 -05:00

@AncientMystic commented on GitHub (Aug 12, 2024):

i have successfully gotten the cpu_avx512 runner and cuda+avx512 built and running

i added at line 259 (now occupying lines 260-274)

function build_cpu_avx512() {
    if ((-not "${env:OLLAMA_SKIP_CPU_GENERATE}" ) -and ((-not "${env:OLLAMA_CPU_TARGET}") -or ("${env:OLLAMA_CPU_TARGET}" -eq "cpu_avx512"))) {
        init_vars
        $script:cmakeDefs = $script:commonCpuDefs + @("-A", "x64", "-DGGML_AVX=on", "-DGGML_AVX2=on", "-DGGML_AVX512=on", "-DGGML_FMA=on", "-DGGML_F16C=on") + $script:cmakeDefs
        $script:buildDir="../build/windows/${script:ARCH}/cpu_avx512"
        $script:distDir="$script:DIST_BASE\cpu_avx512"
        write-host "Building AVX512 CPU"
        build
        sign
        install
    } else {
        write-host "Skipping CPU AVX512 generation step as requested"
    }
}

and build_cpu_avx512 now occupying line 432 directly under build_cpu_avx2

then ran the commands

$env:OLLAMA_CPU_TARGET="cpu_avx512"
go generate ./...
go build .

results seem to be up to 10-20% token increase on some models (more so equal to cuda+avx2 on others), plus even smoother generation speed, the pause between chunks seems extremely small now to the point if it was just a little lower it would be hard to notice if it was even pausing at all, this is a significant improvement over the standard CUDA+AVX where the pause at low token rates with part of the model offloaded to the cpu is so long you are waiting twice as long for the response to finish

EDIT: just noticed too cpu usage seems to have dropped dramatically, it is currently hitting 40% it was often 90-98% when ollama was generating before with CUDA+AVX, so this will save power too

@AncientMystic commented on GitHub (Aug 12, 2024): i have successfully gotten the `cpu_avx512` runner and `cuda+avx512` built and running i added at line 259 (now occupying lines 260-274) ``` function build_cpu_avx512() { if ((-not "${env:OLLAMA_SKIP_CPU_GENERATE}" ) -and ((-not "${env:OLLAMA_CPU_TARGET}") -or ("${env:OLLAMA_CPU_TARGET}" -eq "cpu_avx512"))) { init_vars $script:cmakeDefs = $script:commonCpuDefs + @("-A", "x64", "-DGGML_AVX=on", "-DGGML_AVX2=on", "-DGGML_AVX512=on", "-DGGML_FMA=on", "-DGGML_F16C=on") + $script:cmakeDefs $script:buildDir="../build/windows/${script:ARCH}/cpu_avx512" $script:distDir="$script:DIST_BASE\cpu_avx512" write-host "Building AVX512 CPU" build sign install } else { write-host "Skipping CPU AVX512 generation step as requested" } } ``` and `build_cpu_avx512` now occupying line 432 directly under build_cpu_avx2 then ran the commands `$env:OLLAMA_CPU_TARGET="cpu_avx512"` `go generate ./...` `go build .` results seem to be up to 10-20% token increase on some models (more so equal to cuda+avx2 on others), plus even smoother generation speed, the pause between chunks seems extremely small now to the point if it was just a little lower it would be hard to notice if it was even pausing at all, this is a significant improvement over the standard CUDA+AVX where the pause at low token rates with part of the model offloaded to the cpu is so long you are waiting twice as long for the response to finish EDIT: just noticed too cpu usage seems to have dropped dramatically, it is currently hitting 40% it was often 90-98% when ollama was generating before with CUDA+AVX, so this will save power too

GiteaMirror commented

2026-05-03 23:29:55 -05:00

@dhiltgen commented on GitHub (Aug 18, 2024):

We've been holding off on avx512 based on our initial performance tests showing a very minimal performance improvement, and the desire to avoid too much sprawl in the permutations of runners we bundle. With #5049 we'll be adding cuda v12, which will let us enable features that don't exist in v11 for newer GPUs, but there's also a potential to consider refining the CPU flags we use for the v12 runner since we can still fall back to v11 with AVX to support older (or server) CPUs that lack the vector extensions.

Our intent is to make it straight forward for users to build from source and tune the CPU flags for the GPU runners, but it's still a bit rough around the edges. If you bump into bugs in setting OLLAMA_CUSTOM_CUDA_DEFS that prevent this from working, PRs are welcome. Just CC me so I can review them.

@dhiltgen commented on GitHub (Aug 18, 2024): We've been holding off on avx512 based on our initial performance tests showing a very minimal performance improvement, and the desire to avoid too much sprawl in the permutations of runners we bundle. With #5049 we'll be adding cuda v12, which will let us enable features that don't exist in v11 for newer GPUs, but there's also a potential to consider refining the CPU flags we use for the v12 runner since we can still fall back to v11 with AVX to support older (or server) CPUs that lack the vector extensions. Our intent is to make it straight forward for users to build from source and tune the CPU flags for the GPU runners, but it's still a bit rough around the edges. If you bump into bugs in setting OLLAMA_CUSTOM_CUDA_DEFS that prevent this from working, PRs are welcome. Just CC me so I can review them.

GiteaMirror commented

2026-05-03 23:29:56 -05:00

@AncientMystic commented on GitHub (Aug 18, 2024):

We've been holding off on avx512 based on our initial performance tests showing a very minimal performance improvement, and the desire to avoid too much sprawl in the permutations of runners we bundle. With #5049 we'll be adding cuda v12, which will let us enable features that don't exist in v11 for newer GPUs, but there's also a potential to consider refining the CPU flags we use for the v12 runner since we can still fall back to v11 with AVX to support older (or server) CPUs that lack the vector extensions.

Our intent is to make it straight forward for users to build from source and tune the CPU flags for the GPU runners, but it's still a bit rough around the edges. If you bump into bugs in setting OLLAMA_CUSTOM_CUDA_DEFS that prevent this from working, PRs are welcome. Just CC me so I can review them.

I can definitely understand not adding 512 by default, there isn't a lot of support for it on cpus, so far the biggest impact im noticing is that with cuda+avx512 it seems to handle larger models more smoothly and at lower cpu usage.

Example i was getting 0.8-1.2t/s with 90-99% cpu usage
I have been testing 8x7b models now (46.7b 25-30gb in size) and i am seeing 3-3.6t/s with no pausing and only 40% cpu usage. (Cuda+avx once anything was on the cpu it seems to pause longer the larger the model and the pausing between words makes it horrible to use)

Which i think cuda+avx2 has the majority of a performance boost to it, avx512 only seems to add a tiny boost on a few models but overall is about the same as you said besides being a little bit smoother, it seems the biggest improvement i noticed with 512 was the massive drop in cpu usage which is a big deal.

I have also been using the KV cache PR #6279 it has also helped significantly with running larger models

I have also been trying to add oneapi to the mix (for my 4gb arc gpu) but cannot seem to get it to compile...... i keep running into errors relating to kernel32.lib and other libraries.

Are there any other PRs optional features or anything that can be enabled/added to increase performance?

I was looking at PR #3468 and the NeuralSpeed backend for skylake or newer seems to have a 7.27x performance increase which would be really amazing but the PR seems to be abandoned. (And doesn't include windows)

(Running a i7-7820x with 96gb ram+Tesla p4 8gb and Arc 310 4gb)

@AncientMystic commented on GitHub (Aug 18, 2024): > We've been holding off on avx512 based on our initial performance tests showing a very minimal performance improvement, and the desire to avoid too much sprawl in the permutations of runners we bundle. With #5049 we'll be adding cuda v12, which will let us enable features that don't exist in v11 for newer GPUs, but there's also a potential to consider refining the CPU flags we use for the v12 runner since we can still fall back to v11 with AVX to support older (or server) CPUs that lack the vector extensions. > > > > Our intent is to make it straight forward for users to build from source and tune the CPU flags for the GPU runners, but it's still a bit rough around the edges. If you bump into bugs in setting OLLAMA_CUSTOM_CUDA_DEFS that prevent this from working, PRs are welcome. Just CC me so I can review them. I can definitely understand not adding 512 by default, there isn't a lot of support for it on cpus, so far the biggest impact im noticing is that with cuda+avx512 it seems to handle larger models more smoothly and at lower cpu usage. Example i was getting 0.8-1.2t/s with 90-99% cpu usage I have been testing 8x7b models now (46.7b 25-30gb in size) and i am seeing 3-3.6t/s with no pausing and only 40% cpu usage. (Cuda+avx once anything was on the cpu it seems to pause longer the larger the model and the pausing between words makes it horrible to use) Which i think cuda+avx2 has the majority of a performance boost to it, avx512 only seems to add a tiny boost on a few models but overall is about the same as you said besides being a little bit smoother, it seems the biggest improvement i noticed with 512 was the massive drop in cpu usage which is a big deal. I have also been using the KV cache PR #6279 it has also helped significantly with running larger models I have also been trying to add oneapi to the mix (for my 4gb arc gpu) but cannot seem to get it to compile...... i keep running into errors relating to kernel32.lib and other libraries. Are there any other PRs optional features or anything that can be enabled/added to increase performance? I was looking at PR #3468 and the NeuralSpeed backend for skylake or newer seems to have a 7.27x performance increase which would be really amazing but the PR seems to be abandoned. (And doesn't include windows) (Running a i7-7820x with 96gb ram+Tesla p4 8gb and Arc 310 4gb)

GiteaMirror commented

2026-05-03 23:29:58 -05:00

@dhiltgen commented on GitHub (Apr 9, 2025):

The new cmake based build now builds multiple CPU optimized libraries distinct from the GPU libraries, including avx512 support.

@dhiltgen commented on GitHub (Apr 9, 2025): The new cmake based build now builds multiple CPU optimized libraries distinct from the GPU libraries, including avx512 support.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#65996