[GH-ISSUE #9018] CUDA error: an illegal memory access was encountered #31625

Open
opened 2026-04-22 12:15:57 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @aginies on GitHub (Feb 11, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9018

What is the issue?

OS: Opensuse Leap15.6
GPU: Nvidia RTX5090
Cuda: 12.8
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.10 Driver Version: 570.86.10 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+

Sounds like there is some cuda error with pre-built version of ollama 0.5.7, probably due to the fact that it was built against cuda 12.4. So latest Nvidia 50xx card with Cuda 12.8 can not work properly.

From client side:

./ollama --version
ollama version is 0.5.7

./ollama run codeqwen:7b 
Error: llama runner process has terminated: CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_buffer_clear at llama/ggml-cuda/ggml-cuda.cu:539
  cudaDeviceSynchronize()
llama/ggml-cuda/ggml-cuda.cu:96: CUDA error

Server side:
./ollama serve
2025/02/11 18:15:42 routes.go:1187: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/aginies/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-02-11T18:15:42.973+01:00 level=INFO source=images.go:432 msg="total blobs: 6"
time=2025-02-11T18:15:42.973+01:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.

  • using env: export GIN_MODE=release
  • using code: gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
[GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2025-02-11T18:15:42.974+01:00 level=INFO source=routes.go:1238 msg="Listening on 127.0.0.1:11434 (version 0.5.7)"
time=2025-02-11T18:15:42.974+01:00 level=INFO source=routes.go:1267 msg="Dynamic LLM libraries" runners="[cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx cpu cpu_avx]"
time=2025-02-11T18:15:42.974+01:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2025-02-11T18:15:43.252+01:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-446f513c-0699-5337-2cd0-9fa3d507cc94 library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.8 GiB"
..........
time=2025-02-11T18:17:26.879+01:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: CUDA error: an illegal memory access was encountered\n current device: 0, in function ggml_backend_cuda_buffer_clear at llama/ggml-cuda/ggml-cuda.cu:539\n cudaDeviceSynchronize()\nllama/ggml-cuda/ggml-cuda.cu:96: CUDA error"

The full debug file:
debug.gz

Relevant log output


OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.5.7

Originally created by @aginies on GitHub (Feb 11, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9018 ### What is the issue? OS: Opensuse Leap15.6 GPU: Nvidia RTX5090 Cuda: 12.8 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.86.10 Driver Version: 570.86.10 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ Sounds like there is some cuda error with pre-built version of ollama 0.5.7, probably due to the fact that it was built against cuda 12.4. So latest Nvidia 50xx card with Cuda 12.8 can not work properly. From client side: ``` ./ollama --version ollama version is 0.5.7 ./ollama run codeqwen:7b Error: llama runner process has terminated: CUDA error: an illegal memory access was encountered current device: 0, in function ggml_backend_cuda_buffer_clear at llama/ggml-cuda/ggml-cuda.cu:539 cudaDeviceSynchronize() llama/ggml-cuda/ggml-cuda.cu:96: CUDA error ``` Server side: ./ollama serve 2025/02/11 18:15:42 routes.go:1187: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/aginies/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-02-11T18:15:42.973+01:00 level=INFO source=images.go:432 msg="total blobs: 6" time=2025-02-11T18:15:42.973+01:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2025-02-11T18:15:42.974+01:00 level=INFO source=routes.go:1238 msg="Listening on 127.0.0.1:11434 (version 0.5.7)" time=2025-02-11T18:15:42.974+01:00 level=INFO source=routes.go:1267 msg="Dynamic LLM libraries" runners="[cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx cpu cpu_avx]" time=2025-02-11T18:15:42.974+01:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs" time=2025-02-11T18:15:43.252+01:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-446f513c-0699-5337-2cd0-9fa3d507cc94 library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.8 GiB" .......... time=2025-02-11T18:17:26.879+01:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: CUDA error: an illegal memory access was encountered\n current device: 0, in function ggml_backend_cuda_buffer_clear at llama/ggml-cuda/ggml-cuda.cu:539\n cudaDeviceSynchronize()\nllama/ggml-cuda/ggml-cuda.cu:96: CUDA error" The full debug file: [debug.gz](https://github.com/user-attachments/files/18755064/debug.gz) ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.5.7
GiteaMirror added the bug label 2026-04-22 12:15:57 -05:00
Author
Owner

@jmorganca commented on GitHub (Feb 11, 2025):

Hi @aginies thanks for the issue! I've tested 0.5.7 with a 5090 and it seems to work - that said I'll look into this. In the meantime would it be possible to test the 0.5.8 prerelease?

<!-- gh-comment-id:2651605218 --> @jmorganca commented on GitHub (Feb 11, 2025): Hi @aginies thanks for the issue! I've tested 0.5.7 with a 5090 and it seems to work - that said I'll look into this. In the meantime would it be possible to test the [0.5.8 prerelease](https://github.com/ollama/ollama/releases/tag/v0.5.8)?
Author
Owner

@aginies commented on GitHub (Feb 12, 2025):

I just tried with the pre release, same error:

`./ollama run codeqwen:7b
Error: llama runner process has terminated: CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_cuda_buffer_clear at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:510
cudaDeviceSynchronize()
//ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:70: CUDA error

./ollama --version
ollama version is 0.5.8
`

<!-- gh-comment-id:2652373371 --> @aginies commented on GitHub (Feb 12, 2025): I just tried with the pre release, same error: `./ollama run codeqwen:7b Error: llama runner process has terminated: CUDA error: an illegal memory access was encountered current device: 0, in function ggml_backend_cuda_buffer_clear at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:510 cudaDeviceSynchronize() //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:70: CUDA error ./ollama --version ollama version is 0.5.8 `
Author
Owner

@aginies commented on GitHub (Feb 12, 2025):

Tweaking the Dockerfile I was able to rebuild ollama on centos9 Stream with a more recent ggc13, cuda 12.8. The old centos7 is not able to get a recent cuda version....
I still have the same issue, so doesn't sounds like this is a Cuda version error.

The Dockerfile in case that can help anyone building and testing with gcc13 and cuda12.8:

ARG CMAKEVERSION=3.31.2

FROM --platform=linux/amd64 dokken/centos-stream-9 AS base-amd64
RUN yum install -y yum-utils gcc-toolset-13-gcc gcc-toolset-13-gcc-c++ xz libstdc++-devel.x86_64 gcc-toolset-13 \
    && yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo \
    && curl -s -L https://github.com/ccache/ccache/releases/download/v4.10.2/ccache-4.10.2-linux-x86_64.tar.xz | tar -Jx -C /usr/local/bin --strip-components 1

FROM base-${TARGETARCH} AS base
ARG CMAKEVERSION
RUN curl -fsSL https://github.com/Kitware/CMake/releases/download/v${CMAKEVERSION}/cmake-${CMAKEVERSION}-linux-$(uname -m).tar.gz | tar xz -C /usr/local --strip-components 1
COPY CMakeLists.txt CMakePresets.json .
COPY ml/backend/ggml/ggml ml/backend/ggml/ggml
ENV LDFLAGS=-s

FROM base AS cpu
RUN if [ "$(uname -m)" = "x86_64" ]; then yum install -y gcc-toolset-13-gcc gcc-toolset-13-gcc-c++ xz libstdc++-devel.x86_64 gcc-toolset-13; fi
ENV PATH=/opt/rh/gcc-toolset-13/root/usr/bin:$PATH
RUN --mount=type=cache,target=/root/.ccache \
    cmake --preset 'CPU' \
        && cmake --build --parallel --preset 'CPU' \
        && cmake --install build --component CPU --strip --parallel 4

FROM base AS cuda-12
ARG CUDA12VERSION=12.8
RUN yum install -y cuda-toolkit-${CUDA12VERSION//./-}
ENV PATH=/usr/local/cuda-12/bin:$PATH
RUN --mount=type=cache,target=/root/.ccache \
    cmake --preset 'CUDA 12' \
        && cmake --build --parallel --preset 'CUDA 12' \
        && cmake --install build --component CUDA --strip --parallel 4

FROM base AS build
ARG GOVERSION=1.23.4
RUN curl -fsSL https://golang.org/dl/go${GOVERSION}.linux-$(case $(uname -m) in x86_64) echo amd64 ;; aarch64) echo arm64 ;; esac).tar.gz | tar xz -C /usr/local
ENV PATH=/usr/local/go/bin:$PATH:/opt/rh/gcc-toolset-13/root/usr/bin/
WORKDIR /go/src/github.com/ollama/ollama
COPY . .
ARG GOFLAGS="'-ldflags=-w -s'"
ENV CGO_ENABLED=1
RUN --mount=type=cache,target=/root/.cache/go-build \
    go build -trimpath -buildmode=pie -o /bin/ollama .

FROM --platform=linux/amd64 scratch AS amd64
COPY --from=cuda-12 dist/lib/ollama/cuda_v12 /lib/ollama/cuda_v12

FROM ${TARGETARCH} AS archive
COPY --from=cpu dist/lib/ollama /lib/ollama
COPY --from=build /bin/ollama /bin/ollama

FROM dokken/centos-stream-9
COPY --from=archive /bin /usr/bin
ENV PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
COPY --from=archive /lib/ollama /usr/lib/ollama
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV NVIDIA_VISIBLE_DEVICES=all
ENV OLLAMA_HOST=0.0.0.0:11434
EXPOSE 11434
ENTRYPOINT ["/bin/ollama"]
CMD ["serve"]
<!-- gh-comment-id:2653113168 --> @aginies commented on GitHub (Feb 12, 2025): Tweaking the Dockerfile I was able to rebuild ollama on centos9 Stream with a more recent ggc13, cuda 12.8. The old centos7 is not able to get a recent cuda version.... I still have the same issue, so doesn't sounds like this is a Cuda version error. The Dockerfile in case that can help anyone building and testing with gcc13 and cuda12.8: ``` ARG CMAKEVERSION=3.31.2 FROM --platform=linux/amd64 dokken/centos-stream-9 AS base-amd64 RUN yum install -y yum-utils gcc-toolset-13-gcc gcc-toolset-13-gcc-c++ xz libstdc++-devel.x86_64 gcc-toolset-13 \ && yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo \ && curl -s -L https://github.com/ccache/ccache/releases/download/v4.10.2/ccache-4.10.2-linux-x86_64.tar.xz | tar -Jx -C /usr/local/bin --strip-components 1 FROM base-${TARGETARCH} AS base ARG CMAKEVERSION RUN curl -fsSL https://github.com/Kitware/CMake/releases/download/v${CMAKEVERSION}/cmake-${CMAKEVERSION}-linux-$(uname -m).tar.gz | tar xz -C /usr/local --strip-components 1 COPY CMakeLists.txt CMakePresets.json . COPY ml/backend/ggml/ggml ml/backend/ggml/ggml ENV LDFLAGS=-s FROM base AS cpu RUN if [ "$(uname -m)" = "x86_64" ]; then yum install -y gcc-toolset-13-gcc gcc-toolset-13-gcc-c++ xz libstdc++-devel.x86_64 gcc-toolset-13; fi ENV PATH=/opt/rh/gcc-toolset-13/root/usr/bin:$PATH RUN --mount=type=cache,target=/root/.ccache \ cmake --preset 'CPU' \ && cmake --build --parallel --preset 'CPU' \ && cmake --install build --component CPU --strip --parallel 4 FROM base AS cuda-12 ARG CUDA12VERSION=12.8 RUN yum install -y cuda-toolkit-${CUDA12VERSION//./-} ENV PATH=/usr/local/cuda-12/bin:$PATH RUN --mount=type=cache,target=/root/.ccache \ cmake --preset 'CUDA 12' \ && cmake --build --parallel --preset 'CUDA 12' \ && cmake --install build --component CUDA --strip --parallel 4 FROM base AS build ARG GOVERSION=1.23.4 RUN curl -fsSL https://golang.org/dl/go${GOVERSION}.linux-$(case $(uname -m) in x86_64) echo amd64 ;; aarch64) echo arm64 ;; esac).tar.gz | tar xz -C /usr/local ENV PATH=/usr/local/go/bin:$PATH:/opt/rh/gcc-toolset-13/root/usr/bin/ WORKDIR /go/src/github.com/ollama/ollama COPY . . ARG GOFLAGS="'-ldflags=-w -s'" ENV CGO_ENABLED=1 RUN --mount=type=cache,target=/root/.cache/go-build \ go build -trimpath -buildmode=pie -o /bin/ollama . FROM --platform=linux/amd64 scratch AS amd64 COPY --from=cuda-12 dist/lib/ollama/cuda_v12 /lib/ollama/cuda_v12 FROM ${TARGETARCH} AS archive COPY --from=cpu dist/lib/ollama /lib/ollama COPY --from=build /bin/ollama /bin/ollama FROM dokken/centos-stream-9 COPY --from=archive /bin /usr/bin ENV PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin COPY --from=archive /lib/ollama /usr/lib/ollama ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility ENV NVIDIA_VISIBLE_DEVICES=all ENV OLLAMA_HOST=0.0.0.0:11434 EXPOSE 11434 ENTRYPOINT ["/bin/ollama"] CMD ["serve"] ```
Author
Owner

@aginies commented on GitHub (Feb 13, 2025):

FYI i have discussed and point this issue in the llama.cpp project:
https://github.com/ggerganov/llama.cpp/issues/11829#issuecomment-2654506589

<!-- gh-comment-id:2657131658 --> @aginies commented on GitHub (Feb 13, 2025): FYI i have discussed and point this issue in the llama.cpp project: https://github.com/ggerganov/llama.cpp/issues/11829#issuecomment-2654506589
Author
Owner

@Moumeneb1 commented on GitHub (Apr 10, 2025):

Hi, did you manage to fix it ?

<!-- gh-comment-id:2794914678 --> @Moumeneb1 commented on GitHub (Apr 10, 2025): Hi, did you manage to fix it ?
Author
Owner

@jksjaz commented on GitHub (May 8, 2025):

Getting the same error with llama4:scout

<!-- gh-comment-id:2864673612 --> @jksjaz commented on GitHub (May 8, 2025): Getting the same error with `llama4:scout`
Author
Owner

@tjwebb commented on GitHub (Oct 19, 2025):

same error with qwen3-coder on RTX 6000 Pro blackwell

<!-- gh-comment-id:3419241148 --> @tjwebb commented on GitHub (Oct 19, 2025): same error with qwen3-coder on RTX 6000 Pro blackwell
Author
Owner

@ovflowd commented on GitHub (Nov 23, 2025):

+1 also facing this on RTX 2000 ADA, RTX 5070 TI

Sun Nov 23 18:43:58 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5070 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|  0%   47C    P1             41W /  300W |     319MiB /  16303MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX 2000 Ada Gene...    Off |   00000000:03:00.0 Off |                  Off |
| 30%   48C    P0             17W /   70W |       5MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2629      G   ...rack-uuid=3190708988185955192        291MiB |
|    0   N/A  N/A           38589      G   resources                                 3MiB |
+-----------------------------------------------------------------------------------------+
❯ ollama -v
ollama version is 0.13.0
<!-- gh-comment-id:3568188592 --> @ovflowd commented on GitHub (Nov 23, 2025): +1 also facing this on RTX 2000 ADA, RTX 5070 TI ``` Sun Nov 23 18:43:58 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 5070 Ti Off | 00000000:01:00.0 Off | N/A | | 0% 47C P1 41W / 300W | 319MiB / 16303MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA RTX 2000 Ada Gene... Off | 00000000:03:00.0 Off | Off | | 30% 48C P0 17W / 70W | 5MiB / 16380MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2629 G ...rack-uuid=3190708988185955192 291MiB | | 0 N/A N/A 38589 G resources 3MiB | +-----------------------------------------------------------------------------------------+ ``` ``` ❯ ollama -v ollama version is 0.13.0 ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#31625