[GH-ISSUE #8324] Add a CUDA+AVX2(VNNI) runner to the Docker image. #51843

Open
opened 2026-04-28 21:03:49 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @x0wllaar on GitHub (Jan 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8324

Description:
I would like to ask to add a CUDA+AVX2 (maybe VNNI) model runner to the default Docker image for Ollama. I think this can help with performance in partial offload scenarios.
This should be supported at build time (#2281), but for some reason I cant find the runner in the docker image

I think that it should be possible for me to do locally by adding --build-arg CUSTOM_CPU_FLAGS=avx2,avxvnni to Docker build, but I still think that it can be beneficial to add a runner target named something like cuda_v12_moderncpu that will enable AVX2 and AVX-VNNI by default and that will build by default.

I might work on this and submit a PR.

Benefits:

  • Squeeze some more performance in partial offload

Example Use Case:
Running a 32B model on a 16GB VRAM GPU (in my case 13900HX + Laptop 4090).

Environment Variables/Configuration:
No additional configuration needed.

Related Files and Code:

  • make/cuda.make (I guess)
Originally created by @x0wllaar on GitHub (Jan 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8324 **Description**: I would like to ask to add a CUDA+AVX2 (maybe VNNI) model runner to the default Docker image for Ollama. I think this can help with performance in partial offload scenarios. This should be supported at build time (#2281), but for some reason I cant find the runner in the docker image I think that it should be possible for me to do locally by adding `--build-arg CUSTOM_CPU_FLAGS=avx2,avxvnni` to Docker build, but I still think that it can be beneficial to add a runner target named something like `cuda_v12_moderncpu` that will enable AVX2 and AVX-VNNI by default and that will build by default. I might work on this and submit a PR. **Benefits**: * Squeeze some more performance in partial offload **Example Use Case**: Running a 32B model on a 16GB VRAM GPU (in my case 13900HX + Laptop 4090). **Environment Variables/Configuration**: No additional configuration needed. **Related Files and Code**: * make/cuda.make (I guess)
GiteaMirror added the feature request label 2026-04-28 21:03:49 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51843