[GH-ISSUE #1309] [WSL2] Cuda error 222 : the provided PTX was compiled with an unsupported toolchain. #62714

Closed
opened 2026-05-03 10:03:46 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @fxrobin on GitHub (Nov 29, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1309

Originally assigned to: @dhiltgen on GitHub.

On Windows WSL2, with Cuda Toolkit Installed and Cuda-Container-Toolkit installed, I'm facing this issue running the official Docker image :

ollama-ollama-1    | 2023/11/29 00:36:04 llama.go:292: 3676 MB VRAM available, loading up to 21 GPU layers
ollama-ollama-1    | 2023/11/29 00:36:04 llama.go:421: starting llama runner
ollama-ollama-1    | 2023/11/29 00:36:04 llama.go:479: waiting for llama runner to start responding
ollama-ollama-1    | ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ollama-ollama-1    | ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ollama-ollama-1    | ggml_init_cublas: found 1 CUDA devices:
ollama-ollama-1    |   Device 0: NVIDIA RTX A1000 Laptop GPU, compute capability 8.6
ollama-ollama-1    |
ollama-ollama-1    | CUDA error 222 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5965: the provided PTX was compiled with an unsupported toolchain.
ollama-ollama-1    | current device: 0
ollama-ollama-1    | 2023/11/29 00:36:04 llama.go:436: 222 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5965: the provided PTX was compiled with an unsupported toolchain.
ollama-ollama-1    | current device: 0
ollama-ollama-1    | 2023/11/29 00:36:04 llama.go:444: error starting llama runner: llama runner process has terminated
ollama-ollama-1    | 2023/11/29 00:36:04 llama.go:510: llama runner stopped successfully
ollama-ollama-1    | 2023/11/29 00:36:04 llama.go:421: starting llama runner
ollama-ollama-1    | 2023/11/29 00:36:04 llama.go:479: waiting for llama runner to start responding
ollama-ollama-1    | {"timestamp":1701218164,"level":"WARNING","function":"server_params_parse","line":2035,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1}
ollama-ollama-1    | {"timestamp":1701218164,"level":"INFO","function":"main","line":2534,"message":"build info","build":375,"commit":"9656026"}
ollama-ollama-1    | {"timestamp":1701218164,"level":"INFO","function":"main","line":2537,"message":"system info","n_threads":12,"n_threads_batch":-1,"total_threads":24,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
ollama-ollama-1    | llama_model_loader: loaded meta data with 18 key-value pairs and 196 tensors from /root/.ollama/models/blobs/sha256:305c4103a989d3f8ac457f912af30f32693f20dcffe1495e18c2ed7b5596b2d1 (version GGUF V2)

So Ollama is not using my GPU.

When I check if Docker can use my GPU, it seems OK :

Tue Nov 28 23:56:24 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.91       Driver Version: 517.89       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A100...  On   | 00000000:01:00.0  On |                  N/A |
| N/A   38C    P8     3W /  N/A |    323MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A        22      G   /Xwayland                       N/A      |
+-----------------------------------------------------------------------------+

On Ollama startup, no warning about not accessing GPU :

ollama-ollama-1    | 2023/11/29 00:07:32 images.go:784: total blobs: 15
ollama-ollama-1    | 2023/11/29 00:07:32 images.go:791: total unused blobs removed: 0
ollama-ollama-1    | 2023/11/29 00:07:32 routes.go:777: Listening on [::]:11434 (version 0.1.12)

Here is my distribution :

$ uname -a
Linux FRLFK0635009890 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy

Models :

root@de433da63a97:/# ollama list
NAME                    ID              SIZE    MODIFIED
codellama:latest        8fdf8f752f6e    3.8 GB  51 minutes ago
codeup:latest           54289661f7a9    7.4 GB  39 minutes ago
falcon:latest           4280f7257e73    4.2 GB  34 minutes ago

When I have a look at the source code of ggml-cuda.cu :

for (int id = 0; id < g_device_count; ++id) {
            CUDA_CHECK(ggml_cuda_set_device(id));

            // create cuda streams
            for (int is = 0; is < MAX_STREAMS; ++is) {
                CUDA_CHECK(cudaStreamCreateWithFlags(&g_cudaStreams[id][is], cudaStreamNonBlocking));
            }

            // create cublas handle
            CUBLAS_CHECK(cublasCreate(&g_cublas_handles[id]));
            CUBLAS_CHECK(cublasSetMathMode(g_cublas_handles[id], CUBLAS_TF32_TENSOR_OP_MATH));
        }

The error is raised by CUDA_CHECK(cudaStreamCreateWithFlags(&g_cudaStreams[id][is], cudaStreamNonBlocking)); in the for loop.

Originally created by @fxrobin on GitHub (Nov 29, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1309 Originally assigned to: @dhiltgen on GitHub. On Windows WSL2, with Cuda Toolkit Installed and Cuda-Container-Toolkit installed, I'm facing this issue running the official Docker image : ``` ollama-ollama-1 | 2023/11/29 00:36:04 llama.go:292: 3676 MB VRAM available, loading up to 21 GPU layers ollama-ollama-1 | 2023/11/29 00:36:04 llama.go:421: starting llama runner ollama-ollama-1 | 2023/11/29 00:36:04 llama.go:479: waiting for llama runner to start responding ollama-ollama-1 | ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ollama-ollama-1 | ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ollama-ollama-1 | ggml_init_cublas: found 1 CUDA devices: ollama-ollama-1 | Device 0: NVIDIA RTX A1000 Laptop GPU, compute capability 8.6 ollama-ollama-1 | ollama-ollama-1 | CUDA error 222 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5965: the provided PTX was compiled with an unsupported toolchain. ollama-ollama-1 | current device: 0 ollama-ollama-1 | 2023/11/29 00:36:04 llama.go:436: 222 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5965: the provided PTX was compiled with an unsupported toolchain. ollama-ollama-1 | current device: 0 ollama-ollama-1 | 2023/11/29 00:36:04 llama.go:444: error starting llama runner: llama runner process has terminated ollama-ollama-1 | 2023/11/29 00:36:04 llama.go:510: llama runner stopped successfully ollama-ollama-1 | 2023/11/29 00:36:04 llama.go:421: starting llama runner ollama-ollama-1 | 2023/11/29 00:36:04 llama.go:479: waiting for llama runner to start responding ollama-ollama-1 | {"timestamp":1701218164,"level":"WARNING","function":"server_params_parse","line":2035,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1} ollama-ollama-1 | {"timestamp":1701218164,"level":"INFO","function":"main","line":2534,"message":"build info","build":375,"commit":"9656026"} ollama-ollama-1 | {"timestamp":1701218164,"level":"INFO","function":"main","line":2537,"message":"system info","n_threads":12,"n_threads_batch":-1,"total_threads":24,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "} ollama-ollama-1 | llama_model_loader: loaded meta data with 18 key-value pairs and 196 tensors from /root/.ollama/models/blobs/sha256:305c4103a989d3f8ac457f912af30f32693f20dcffe1495e18c2ed7b5596b2d1 (version GGUF V2) ``` So Ollama is not using my GPU. When I check if Docker can use my GPU, it seems OK : ```$ docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi Tue Nov 28 23:56:24 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.91 Driver Version: 517.89 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA RTX A100... On | 00000000:01:00.0 On | N/A | | N/A 38C P8 3W / N/A | 323MiB / 4096MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 22 G /Xwayland N/A | +-----------------------------------------------------------------------------+ ``` On Ollama startup, no warning about not accessing GPU : ``` ollama-ollama-1 | 2023/11/29 00:07:32 images.go:784: total blobs: 15 ollama-ollama-1 | 2023/11/29 00:07:32 images.go:791: total unused blobs removed: 0 ollama-ollama-1 | 2023/11/29 00:07:32 routes.go:777: Listening on [::]:11434 (version 0.1.12) ``` Here is my distribution : ``` $ uname -a Linux FRLFK0635009890 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04.2 LTS Release: 22.04 Codename: jammy ``` Models : ``` root@de433da63a97:/# ollama list NAME ID SIZE MODIFIED codellama:latest 8fdf8f752f6e 3.8 GB 51 minutes ago codeup:latest 54289661f7a9 7.4 GB 39 minutes ago falcon:latest 4280f7257e73 4.2 GB 34 minutes ago ``` When I have a look at the source code of `ggml-cuda.cu` : ``` for (int id = 0; id < g_device_count; ++id) { CUDA_CHECK(ggml_cuda_set_device(id)); // create cuda streams for (int is = 0; is < MAX_STREAMS; ++is) { CUDA_CHECK(cudaStreamCreateWithFlags(&g_cudaStreams[id][is], cudaStreamNonBlocking)); } // create cublas handle CUBLAS_CHECK(cublasCreate(&g_cublas_handles[id])); CUBLAS_CHECK(cublasSetMathMode(g_cublas_handles[id], CUBLAS_TF32_TENSOR_OP_MATH)); } ``` The error is raised by `CUDA_CHECK(cudaStreamCreateWithFlags(&g_cudaStreams[id][is], cudaStreamNonBlocking));` in the for loop.
GiteaMirror added the nvidia label 2026-05-03 10:03:46 -05:00
Author
Owner

@fxrobin commented on GitHub (Nov 29, 2023):

Just to be sure, I have installed another Ollama running natively without Docker on the same computer, and everything is fine. My GPU is used and no error in the log file.

Nov 29 09:22:01 FRLFK0635009890 ollama[5932]: 2023/11/29 09:22:01 llama.go:292: 3758 MB VRAM available, loading up to 24 GPU layers
Nov 29 09:22:01 FRLFK0635009890 ollama[5932]: 2023/11/29 09:22:01 llama.go:421: starting llama runner
Nov 29 09:22:01 FRLFK0635009890 ollama[5932]: 2023/11/29 09:22:01 llama.go:479: waiting for llama runner to start responding
Nov 29 09:22:01 FRLFK0635009890 ollama[5932]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
Nov 29 09:22:01 FRLFK0635009890 ollama[5932]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
Nov 29 09:22:01 FRLFK0635009890 ollama[5932]: ggml_init_cublas: found 1 CUDA devices:
Nov 29 09:22:01 FRLFK0635009890 ollama[5932]:   Device 0: NVIDIA RTX A1000 Laptop GPU, compute capability 8.6
Nov 29 09:22:03 FRLFK0635009890 ollama[6284]: {"timestamp":1701246123,"level":"INFO","function":"main","line":2534,"message":"build info","build":375,"commit":"9656026"}
Nov 29 09:22:03 FRLFK0635009890 ollama[6284]: {"timestamp":1701246123,"level":"INFO","function":"main","line":2537,"message":"system info","n_threads":12,"n_threads_batch":-1,"total_threads":24,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}

So my issue is with the Docker official image.

Here is how I use it :

   ollama:
    image: ollama/ollama
    environment:
      - OLLAMA_ORIGINS=*
      - OLLAMA_HOST=0.0.0.0:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "11434:11434"
    volumes:
      - ./ollama:/root/.ollama
<!-- gh-comment-id:1831436105 --> @fxrobin commented on GitHub (Nov 29, 2023): Just to be sure, I have installed another Ollama running natively without Docker on the same computer, and everything is fine. My GPU is used and no error in the log file. ``` Nov 29 09:22:01 FRLFK0635009890 ollama[5932]: 2023/11/29 09:22:01 llama.go:292: 3758 MB VRAM available, loading up to 24 GPU layers Nov 29 09:22:01 FRLFK0635009890 ollama[5932]: 2023/11/29 09:22:01 llama.go:421: starting llama runner Nov 29 09:22:01 FRLFK0635009890 ollama[5932]: 2023/11/29 09:22:01 llama.go:479: waiting for llama runner to start responding Nov 29 09:22:01 FRLFK0635009890 ollama[5932]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no Nov 29 09:22:01 FRLFK0635009890 ollama[5932]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes Nov 29 09:22:01 FRLFK0635009890 ollama[5932]: ggml_init_cublas: found 1 CUDA devices: Nov 29 09:22:01 FRLFK0635009890 ollama[5932]: Device 0: NVIDIA RTX A1000 Laptop GPU, compute capability 8.6 Nov 29 09:22:03 FRLFK0635009890 ollama[6284]: {"timestamp":1701246123,"level":"INFO","function":"main","line":2534,"message":"build info","build":375,"commit":"9656026"} Nov 29 09:22:03 FRLFK0635009890 ollama[6284]: {"timestamp":1701246123,"level":"INFO","function":"main","line":2537,"message":"system info","n_threads":12,"n_threads_batch":-1,"total_threads":24,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "} ``` So my issue is with the Docker official image. Here is how I use it : ``` ollama: image: ollama/ollama environment: - OLLAMA_ORIGINS=* - OLLAMA_HOST=0.0.0.0:11434 deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ports: - "11434:11434" volumes: - ./ollama:/root/.ollama ```
Author
Owner

@fxrobin commented on GitHub (Nov 29, 2023):

Ok, I found a workaround, creating my own Dockerfile (and image) with this :

FROM nvcr.io/nvidia/cuda:11.6.1-devel-ubuntu20.04

RUN apt-get update && apt-get install -y ca-certificates curl

RUN curl https://ollama.ai/install.sh | sh


EXPOSE 11434
ENV OLLAMA_HOST 0.0.0.0
ENTRYPOINT ["/usr/local/bin/ollama"]
CMD ["serve"]

Now it's working like a charm in Docker. No errors. GPU is used.

2023/11/29 11:10:54 llama.go:292: 3641 MB VRAM available, loading up to 21 GPU layers
2023/11/29 11:10:54 llama.go:421: starting llama runner
2023/11/29 11:10:54 llama.go:479: waiting for llama runner to start responding
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA RTX A1000 Laptop GPU, compute capability 8.6
{"timestamp":1701256256,"level":"INFO","function":"main","line":2534,"message":"build info","build":375,"commit":"9656026"}
{"timestamp":1701256256,"level":"INFO","function":"main","line":2537,"message":"system info","n_threads":12,"n_threads_batch":-1,"total_threads":24,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256:6ae28029995007a3ee8d0b8556d50f3b59b831074cf19c84de87acf51fb54054 (version GGUF V2)
<!-- gh-comment-id:1831694803 --> @fxrobin commented on GitHub (Nov 29, 2023): Ok, I found a workaround, creating my own Dockerfile (and image) with this : ``` FROM nvcr.io/nvidia/cuda:11.6.1-devel-ubuntu20.04 RUN apt-get update && apt-get install -y ca-certificates curl RUN curl https://ollama.ai/install.sh | sh EXPOSE 11434 ENV OLLAMA_HOST 0.0.0.0 ENTRYPOINT ["/usr/local/bin/ollama"] CMD ["serve"] ``` Now it's working like a charm in Docker. No errors. GPU is used. ``` 2023/11/29 11:10:54 llama.go:292: 3641 MB VRAM available, loading up to 21 GPU layers 2023/11/29 11:10:54 llama.go:421: starting llama runner 2023/11/29 11:10:54 llama.go:479: waiting for llama runner to start responding ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA RTX A1000 Laptop GPU, compute capability 8.6 {"timestamp":1701256256,"level":"INFO","function":"main","line":2534,"message":"build info","build":375,"commit":"9656026"} {"timestamp":1701256256,"level":"INFO","function":"main","line":2537,"message":"system info","n_threads":12,"n_threads_batch":-1,"total_threads":24,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "} llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256:6ae28029995007a3ee8d0b8556d50f3b59b831074cf19c84de87acf51fb54054 (version GGUF V2) ```
Author
Owner

@djmaze commented on GitHub (Dec 14, 2023):

Same here. The problem is that the final docker image does not contain any CUDA libraries.

Changing line 17 in the Dockerfile to FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04 should fix this. (Tried this with a derived image successfully.)

<!-- gh-comment-id:1856930028 --> @djmaze commented on GitHub (Dec 14, 2023): Same here. The problem is that the final docker image does not contain any CUDA libraries. Changing [line 17 in the Dockerfile](https://github.com/jmorganca/ollama/blob/6e16098a60ae3834cd5f547d7e26f9e800c589c7/Dockerfile#L17C6-L17C18) to `FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04` should fix this. (Tried this with a derived image successfully.)
Author
Owner

@djmaze commented on GitHub (Dec 20, 2023):

I created a PR for this in #1644. See if it helps you as well.

<!-- gh-comment-id:1865233288 --> @djmaze commented on GitHub (Dec 20, 2023): I created a PR for this in #1644. See if it helps you as well.
Author
Owner

@pdevine commented on GitHub (Jan 25, 2024):

@fxrobin are you still seeing this issue in 0.1.20?

<!-- gh-comment-id:1911063297 --> @pdevine commented on GitHub (Jan 25, 2024): @fxrobin are you still seeing this issue in 0.1.20?
Author
Owner

@dhiltgen commented on GitHub (Mar 12, 2024):

We had a bug a while back where we were not setting the correct environment variables on our container image which resulted in the nvidia container runtime sometimes not mounting the libraries and passing through the GPU into the container as it is supposed to. This should be fixed now. If you're still facing any problems with the latest release, let us know.

<!-- gh-comment-id:1992045459 --> @dhiltgen commented on GitHub (Mar 12, 2024): We had a bug a while back where we were not setting the correct environment variables on our container image which resulted in the nvidia container runtime sometimes not mounting the libraries and passing through the GPU into the container as it is supposed to. This should be fixed now. If you're still facing any problems with the latest release, let us know.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62714