[GH-ISSUE #2281] Support GPU runners with AVX2 #63351

Closed
opened 2026-05-03 13:06:29 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @hyjwei on GitHub (Jan 30, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2281

Originally assigned to: @dhiltgen on GitHub.

I am running ollama on i7-14700K, which supports AVX2 and AVX_VNNI, and a GeForce RTX 1060.

After reading #2205, I enable OLLAMA_DEBUG=1 to check if ollama utilize AVX2 of this CPU. But unlike that one, I couldn't get ollama to use AVX2. The debug message has:

time=2024-01-30T12:27:26.016-05:00 level=INFO source=/tmp/ollama/gpu/gpu.go:146 msg="CUDA Compute Capability detected: 6.1"
time=2024-01-30T12:27:26.016-05:00 level=INFO source=/tmp/ollama/gpu/cpu_common.go:11 msg="CPU has AVX2"
loading library /tmp/ollama1660685050/cuda_v12/libext_server.so
time=2024-01-30T12:27:26.032-05:00 level=INFO source=/tmp/ollama/llm/dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama1660685050/cuda_v12/libext_server.so"
time=2024-01-30T12:27:26.032-05:00 level=INFO source=/tmp/ollama/llm/dyn_ext_server.go:145 msg="Initializing llama server"
[1706635646] system info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
[1706635646] Performing pre-initialization of GPU
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1060 3GB, compute capability 6.1, VMM: yes

Thus ollama does detect GPU and also reports CPU has AVX2. However, when initializing server, it shows AVX2 = 0 as well as AVX_VNNI = 0.

I also follow here, setting OLLAMA_CUSTOM_CPU_DEFS="-DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_F16C=on -DLLAMA_FMA=on", to build the binary locally with AVX2 support. However, the result is the same as the released binary, and I still get AVX_VNNI = 0 | AVX2 = 0. How can I make ollama use AVX2 in my CPU?

Originally created by @hyjwei on GitHub (Jan 30, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2281 Originally assigned to: @dhiltgen on GitHub. I am running ollama on i7-14700K, which supports AVX2 and AVX_VNNI, and a GeForce RTX 1060. After reading #2205, I enable `OLLAMA_DEBUG=1` to check if ollama utilize AVX2 of this CPU. But unlike that one, I couldn't get ollama to use AVX2. The debug message has: ``` time=2024-01-30T12:27:26.016-05:00 level=INFO source=/tmp/ollama/gpu/gpu.go:146 msg="CUDA Compute Capability detected: 6.1" time=2024-01-30T12:27:26.016-05:00 level=INFO source=/tmp/ollama/gpu/cpu_common.go:11 msg="CPU has AVX2" loading library /tmp/ollama1660685050/cuda_v12/libext_server.so time=2024-01-30T12:27:26.032-05:00 level=INFO source=/tmp/ollama/llm/dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama1660685050/cuda_v12/libext_server.so" time=2024-01-30T12:27:26.032-05:00 level=INFO source=/tmp/ollama/llm/dyn_ext_server.go:145 msg="Initializing llama server" [1706635646] system info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | [1706635646] Performing pre-initialization of GPU ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes ggml_init_cublas: CUDA_USE_TENSOR_CORES: no ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1060 3GB, compute capability 6.1, VMM: yes ``` Thus ollama does detect GPU and also reports `CPU has AVX2`. However, when initializing server, it shows `AVX2 = 0` as well as `AVX_VNNI = 0`. I also follow [here](https://github.com/ollama/ollama/blob/main/docs/development.md), setting `OLLAMA_CUSTOM_CPU_DEFS="-DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_F16C=on -DLLAMA_FMA=on"`, to build the binary locally with AVX2 support. However, the result is the same as the released binary, and I still get `AVX_VNNI = 0 | AVX2 = 0`. How can I make ollama use AVX2 in my CPU?
GiteaMirror added the gpubuildfeature request labels 2026-05-03 13:06:31 -05:00
Author
Owner

@hyjwei commented on GitHub (Jan 30, 2024):

Here is my local go compiling log:

+ echo 'CUDA libraries detected - building dynamic CUDA library'
CUDA libraries detected - building dynamic CUDA library
+ init_vars
+ case "${GOARCH}" in
+ ARCH=x86_64
+ LLAMACPP_DIR=../llama.cpp
+ CMAKE_DEFS=
+ CMAKE_TARGETS='--target ext_server'
+ echo ''
+ grep -- -g
+ CMAKE_DEFS='-DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off '
+ case $(uname -s) in
++ uname -s
+ LIB_EXT=so
+ WHOLE_ARCHIVE=-Wl,--whole-archive
+ NO_WHOLE_ARCHIVE=-Wl,--no-whole-archive
+ GCC_ARCH=
+ '[' -z '50;52;61;70;75;80' ']'
++ head -1
++ cut -f3 -d.
++ ls /usr/local/cuda/lib64/libcudart.so.12 /usr/local/cuda/lib64/libcudart.so.12.3.101
+ CUDA_MAJOR=12
+ '[' -n 12 ']'
+ CUDA_VARIANT=_v12
+ CMAKE_DEFS='-DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on -DCMAKE_CUDA_ARCHITECTURES=50;52;61;70;75;80 -DCMAKE_POSITION_INDEPENDENT_CODE=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off '
+ BUILD_DIR=../llama.cpp/build/linux/x86_64/cuda_v12
+ EXTRA_LIBS='-L/usr/local/cuda/lib64 -lcudart -lcublas -lcublasLt -lcuda'
+ build
+ cmake -S ../llama.cpp -B ../llama.cpp/build/linux/x86_64/cuda_v12 -DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on '-DCMAKE_CUDA_ARCHITECTURES=50;52;61;70;75;80' -DCMAKE_POSITION_INDEPENDENT_CODE=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off

Here CMAKE_DEFS='-DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on -DCMAKE_CUDA_ARCHITECTURES=50;52;61;70;75;80 -DCMAKE_POSITION_INDEPENDENT_CODE=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off ', when it is building CUDA target.

I check the script in llm/generate/gen_linux.sh, it looks like OLLAMA_CUSTOM_CPU_DEFS is only used when building CPU target. When building CUDA target, it uses COMMON_CMAKE_DEFS, which sets -DLLAMA_AVX2=off.

I changed it to COMMON_CMAKE_DEFS="-DCMAKE_POSITION_INDEPENDENT_CODE=on -DLLAMA_NATIVE=on -DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_AVX512=off -DLLAMA_FMA=on -DLLAMA_F16C=on" and re-build ollama binary. It works now with AVX2 enabled.

So, I suggest adding the similar code of using OLLAMA_CUSTOM_CPU_DEFS into blocks building dynamic CUDA library.

<!-- gh-comment-id:1917900486 --> @hyjwei commented on GitHub (Jan 30, 2024): Here is my local go compiling log: ``` + echo 'CUDA libraries detected - building dynamic CUDA library' CUDA libraries detected - building dynamic CUDA library + init_vars + case "${GOARCH}" in + ARCH=x86_64 + LLAMACPP_DIR=../llama.cpp + CMAKE_DEFS= + CMAKE_TARGETS='--target ext_server' + echo '' + grep -- -g + CMAKE_DEFS='-DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off ' + case $(uname -s) in ++ uname -s + LIB_EXT=so + WHOLE_ARCHIVE=-Wl,--whole-archive + NO_WHOLE_ARCHIVE=-Wl,--no-whole-archive + GCC_ARCH= + '[' -z '50;52;61;70;75;80' ']' ++ head -1 ++ cut -f3 -d. ++ ls /usr/local/cuda/lib64/libcudart.so.12 /usr/local/cuda/lib64/libcudart.so.12.3.101 + CUDA_MAJOR=12 + '[' -n 12 ']' + CUDA_VARIANT=_v12 + CMAKE_DEFS='-DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on -DCMAKE_CUDA_ARCHITECTURES=50;52;61;70;75;80 -DCMAKE_POSITION_INDEPENDENT_CODE=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off ' + BUILD_DIR=../llama.cpp/build/linux/x86_64/cuda_v12 + EXTRA_LIBS='-L/usr/local/cuda/lib64 -lcudart -lcublas -lcublasLt -lcuda' + build + cmake -S ../llama.cpp -B ../llama.cpp/build/linux/x86_64/cuda_v12 -DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on '-DCMAKE_CUDA_ARCHITECTURES=50;52;61;70;75;80' -DCMAKE_POSITION_INDEPENDENT_CODE=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off ``` Here `CMAKE_DEFS='-DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on -DCMAKE_CUDA_ARCHITECTURES=50;52;61;70;75;80 -DCMAKE_POSITION_INDEPENDENT_CODE=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off '`, when it is building CUDA target. I check the script in `llm/generate/gen_linux.sh`, it looks like `OLLAMA_CUSTOM_CPU_DEFS` is only used when building CPU target. When building CUDA target, it uses `COMMON_CMAKE_DEFS`, which sets `-DLLAMA_AVX2=off`. I changed it to `COMMON_CMAKE_DEFS="-DCMAKE_POSITION_INDEPENDENT_CODE=on -DLLAMA_NATIVE=on -DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_AVX512=off -DLLAMA_FMA=on -DLLAMA_F16C=on"` and re-build ollama binary. It works now with AVX2 enabled. So, I suggest adding the similar code of using `OLLAMA_CUSTOM_CPU_DEFS` into blocks building dynamic CUDA library.
Author
Owner

@dhiltgen commented on GitHub (Mar 12, 2024):

At present, we only compile the GPU runners with AVX. Some users want no vector extensions which is tracked via #2187

We're trying to avoid sprawl of too many permutations, so we'll need to verify this has a large enough performance impact when running models split between CPU/GPU to justify adding it.

<!-- gh-comment-id:1992600487 --> @dhiltgen commented on GitHub (Mar 12, 2024): At present, we only compile the GPU runners with AVX. Some users want no vector extensions which is tracked via #2187 We're trying to avoid sprawl of too many permutations, so we'll need to verify this has a large enough performance impact when running models split between CPU/GPU to justify adding it.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 19, 2024):

@dhiltgen

At present, we only compile the GPU runners with AVX. Some users want no vector extensions which is tracked via #2187

We're trying to avoid sprawl of too many permutations, so we'll need to verify this has a large enough performance impact when running models split between CPU/GPU to justify adding it.

Does this mean Ollama is not utilizing all of the CPU's features, and if it were, we would see improved CPU times?

<!-- gh-comment-id:2066651418 --> @MarkWard0110 commented on GitHub (Apr 19, 2024): @dhiltgen > At present, we only compile the GPU runners with AVX. Some users want no vector extensions which is tracked via #2187 > > We're trying to avoid sprawl of too many permutations, so we'll need to verify this has a large enough performance impact when running models split between CPU/GPU to justify adding it. Does this mean Ollama is not utilizing all of the CPU's features, and if it were, we would see improved CPU times?
Author
Owner

@MarkWard0110 commented on GitHub (Apr 19, 2024):

@dhiltgen , what could I do to help benchmark an Ollama build that fully utilizes the features of an Intel Core i9 14900k? It has an integrated GPU, which I'd be interested in testing if I even utilize it with the CPU. The benchmark would be offset with an Nvidia RTX 4070 TI Super 16GB.

<!-- gh-comment-id:2066659949 --> @MarkWard0110 commented on GitHub (Apr 19, 2024): @dhiltgen , what could I do to help benchmark an Ollama build that fully utilizes the features of an Intel Core i9 14900k? It has an integrated GPU, which I'd be interested in testing if I even utilize it with the CPU. The benchmark would be offset with an Nvidia RTX 4070 TI Super 16GB.
Author
Owner

@dhiltgen commented on GitHub (Apr 23, 2024):

@MarkWard0110 to clarify how this works, we compile multiple variations of the llama.cpp component with different compile flags. At present, for x86, we compile 3 variations for CPU only usage (no GPU support). One with no vector extensions (simply called "cpu") and 2 with AVX* support ("cpu_avx" and "cpu_avx2") to take advantage of vector math extensions available on many CPUs. We compile 1 each for CUDA and ROCm to support NVIDIA and AMD GPUs. These two GPU runners are primarily intended for GPU usage, however, if you attempt to load a model that is larger than the available VRAM, it spills over into system memory and has to use the CPU to solve those portions of the LLM. It's this spill-over that comes into play with CPU vector extensions. If you have 100% of the model in GPU VRAM, then the CPU vector features are ~irrelevant. If you spill-over, then the speed of those portions running on CPU are impacted by the vector extensions. Our testing shows AVX2 is ~10% faster than AVX, and AVX is ~400% faster than no vector extensions. This is why we've focused on AVX as the sweet spot for our GPU runners. We pick which runner to execute at runtime based on discovering the capabilities of the CPU, and what GPU(s) we find.

Each variant we add increases build times, install size, and general complexity of the system, so we're trying to balance permutation sprawl vs. optimized performance.

<!-- gh-comment-id:2072890539 --> @dhiltgen commented on GitHub (Apr 23, 2024): @MarkWard0110 to clarify how this works, we compile multiple variations of the llama.cpp component with different compile flags. At present, for x86, we compile 3 variations for CPU only usage (no GPU support). One with no vector extensions (simply called "cpu") and 2 with AVX* support ("cpu_avx" and "cpu_avx2") to take advantage of vector math extensions available on many CPUs. We compile 1 each for CUDA and ROCm to support NVIDIA and AMD GPUs. These two GPU runners are primarily intended for GPU usage, however, if you attempt to load a model that is larger than the available VRAM, it spills over into system memory and has to use the CPU to solve those portions of the LLM. It's this spill-over that comes into play with CPU vector extensions. If you have 100% of the model in GPU VRAM, then the CPU vector features are ~irrelevant. If you spill-over, then the speed of those portions running on CPU are impacted by the vector extensions. Our testing shows AVX2 is ~10% faster than AVX, and AVX is ~400% faster than no vector extensions. This is why we've focused on AVX as the sweet spot for our GPU runners. We pick which runner to execute at runtime based on discovering the capabilities of the CPU, and what GPU(s) we find. Each variant we add increases build times, install size, and general complexity of the system, so we're trying to balance permutation sprawl vs. optimized performance.
Author
Owner

@dhiltgen commented on GitHub (Jun 1, 2024):

PR #4517 lays foundation so we can document how to ~easily build from source to get a local build with different vector extensions for the GPU runners. Once that's merged this issue can be resolved with developer docs.

<!-- gh-comment-id:2143572240 --> @dhiltgen commented on GitHub (Jun 1, 2024): PR #4517 lays foundation so we can document how to ~easily build from source to get a local build with different vector extensions for the GPU runners. Once that's merged this issue can be resolved with developer docs.
Author
Owner
<!-- gh-comment-id:2314102206 --> @ayttop commented on GitHub (Aug 28, 2024): https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63351