[GH-ISSUE #13281] illegal instruction & GPU discovery timeout on Windows 10/11 with Quadro RTX 5000 (CUDA 12.8) – Ollama 0.13.0 #34536

Open
opened 2026-04-22 18:11:44 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @watsonts on GitHub (Dec 1, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13281

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

<html>

I’m encountering two recurring issues when running Ollama 0.13.0 on a Windows machine (Quadro RTX 5000, CUDA 12.8, driver 573.57). Below is a concise summary of the symptoms, environment, and a reproduction plan. Any guidance or a fix would be greatly appreciated.


1. Environment

Item | Value -- | -- GPU | Quadro RTX 5000 (Compute Capability 7.5, 16 GiB) Driver | NVIDIA‑573.57 (CUDA 12.8) CUDA Toolkit | 12.8.61 (nvcc) Ollama | 0.13.0 (stand‑alone Windows binary) OS | Windows 11 Pro (21H2) Other | No other GPU‑intensive processes running

5. What I Suspect

The binary packaged in the 0.13.0 release appears to be built against CUDA 12.5/12.6 (or earlier). When executed against a CUDA 12.8 runtime, the PTX compiled for newer GPU compute capabilities triggers an illegal instruction in ggml_cuda_compute_forward. The Flash Attention path seems to exacerbate this, as it uses cuBLAS/cuDNN routines that are only available in newer driver versions.

6. Suggested Fix / Request

  1. Rebuild the Windows binary for 0.13.x using the latest CUDA 12.8 toolkit (or at least provide a binary built for 12.8).
  2. Alternatively, ship a CPU‑only fallback that is automatically used when the GPU runtime fails.
  3. If Flash Attention is required, ensure that the binary is compiled with the correct PTX for compute capability 7.5 (RTX 5000) and that the driver supports the required cuBLAS/cuDNN API.

7. Sample Log Snippet

time=2025-12-01T08:27:04.257+08:00 level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32
load_backend: loaded CPU backend from C:\Users\admin\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Quadro RTX 5000, compute capability 7.5, VMM: yes, ID: GPU-c380d34f-d337-c74a-14c9-9f5d7dc17349
load_backend: loaded CUDA backend from C:\Users\admin\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
...
ggml_cuda_compute_forward: SWIGLU_OAI failed
CUDA error: an illegal instruction was encountered

8. Next Steps

  • If you can provide a newer 0.13.x release that compiles against CUDA 12.8, I can retest on my system.
  • Alternatively, if there’s an experimental flag to force the older PTX, please let me know.

Thank you for your hard work on Ollama! Looking forward to your guidance.

</html>

Relevant log output


OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.30.0

Originally created by @watsonts on GitHub (Dec 1, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13281 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? <html> <body> <!--StartFragment--><p style="box-sizing: border-box; border: 0px solid; margin: 1.25em 0px; padding: 0px; --tw-space-y-reverse: 0; color: rgb(64, 64, 64); font-family: ui-sans-serif, system-ui, &quot;Segoe UI&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">I’m encountering two recurring issues when running Ollama 0.13.0 on a Windows machine (Quadro RTX 5000, CUDA 12.8, driver 573.57). Below is a concise summary of the symptoms, environment, and a reproduction plan. Any guidance or a fix would be greatly appreciated.</p><hr class="my-6 border-border" data-streamdown="horizontal-rule" style="box-sizing: border-box; border-width: 1px 0px 0px; border-style: solid; border-color: rgb(229, 229, 229); border-image: initial; margin: 3em 0px; padding: 0px; height: 0px; color: inherit; --tw-space-y-reverse: 0; font-family: ui-sans-serif, system-ui, &quot;Segoe UI&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><h3 class="mt-6 mb-2 font-semibold text-xl" data-streamdown="heading-3" style="box-sizing: border-box; border: 0px solid; margin: 24px 0px 8px; padding: 0px; font-size: 1.25em; font-weight: 600; --tw-space-y-reverse: 0; color: rgb(38, 38, 38); line-height: 1.6; --tw-font-weight: 600; font-family: ui-sans-serif, system-ui, &quot;Segoe UI&quot;, sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">1. Environment</h3><div class="overflow-x-auto max-w-full" style="box-sizing: border-box; border: 0px solid; margin: 0px 0px 16px; padding: 0px; --tw-space-y-reverse: 0; max-width: 100%; overflow-x: auto; color: rgb(64, 64, 64); font-family: ui-sans-serif, system-ui, &quot;Segoe UI&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> Item | Value -- | -- GPU | Quadro RTX 5000 (Compute Capability 7.5, 16 GiB) Driver | NVIDIA‑573.57 (CUDA 12.8) CUDA Toolkit | 12.8.61 (nvcc) Ollama | 0.13.0 (stand‑alone Windows binary) OS | Windows 11 Pro (21H2) Other | No other GPU‑intensive processes running </div><h3 class="mt-6 mb-2 font-semibold text-xl" data-streamdown="heading-3" style="box-sizing: border-box; border: 0px solid; margin: 24px 0px 8px; padding: 0px; font-size: 1.25em; font-weight: 600; --tw-space-y-reverse: 0; color: rgb(38, 38, 38); line-height: 1.6; --tw-font-weight: 600; font-family: ui-sans-serif, system-ui, &quot;Segoe UI&quot;, sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">5. What I Suspect</h3><p style="box-sizing: border-box; border: 0px solid; margin: 0px 0px 1.25em; padding: 0px; --tw-space-y-reverse: 0; color: rgb(64, 64, 64); font-family: ui-sans-serif, system-ui, &quot;Segoe UI&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">The binary packaged in the 0.13.0 release appears to be built against<span> </span><span class="font-semibold" data-streamdown="strong" style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px; --tw-font-weight: 600; font-weight: 600;">CUDA 12.5/12.6</span><span> </span>(or earlier). When executed against a<span> </span><span class="font-semibold" data-streamdown="strong" style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px; --tw-font-weight: 600; font-weight: 600;">CUDA 12.8</span><span> </span>runtime, the PTX compiled for newer GPU compute capabilities triggers an illegal instruction in<span> </span><code class="rounded bg-muted px-1.5 py-0.5 font-mono text-sm" data-streamdown="inline-code" style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 2px 6px; font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, &quot;Liberation Mono&quot;, &quot;Courier New&quot;, monospace; font-feature-settings: normal; font-variation-settings: normal; font-size: 14.4px; color: rgb(64, 64, 64); font-weight: 400; border-radius: 6px; line-height: 1.42857; background-color: rgb(245, 245, 245); --tw-font-weight: 400;">ggml_cuda_compute_forward</code>. The Flash Attention path seems to exacerbate this, as it uses cuBLAS/cuDNN routines that are only available in newer driver versions.</p><h3 class="mt-6 mb-2 font-semibold text-xl" data-streamdown="heading-3" style="box-sizing: border-box; border: 0px solid; margin: 24px 0px 8px; padding: 0px; font-size: 1.25em; font-weight: 600; --tw-space-y-reverse: 0; color: rgb(38, 38, 38); line-height: 1.6; --tw-font-weight: 600; font-family: ui-sans-serif, system-ui, &quot;Segoe UI&quot;, sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">6. Suggested Fix / Request</h3><ol class="ml-4 list-outside list-decimal whitespace-normal" data-streamdown="ordered-list" style="box-sizing: border-box; border: 0px solid; margin: 0px 0px 1.25em; padding: 0px 0px 0px 1.625em; list-style: decimal; --tw-space-y-reverse: 0; color: rgb(64, 64, 64); font-family: ui-sans-serif, system-ui, &quot;Segoe UI&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><li class="py-1" data-streamdown="list-item" style="box-sizing: border-box; border: 0px solid; margin: 0.5em 0px; padding: 4px 0px 4px 0.375em;"><span class="font-semibold" data-streamdown="strong" style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px; --tw-font-weight: 600; font-weight: 600;">Rebuild the Windows binary</span><span> </span>for 0.13.x using the latest CUDA 12.8 toolkit (or at least provide a binary built for 12.8).</li><li class="py-1" data-streamdown="list-item" style="box-sizing: border-box; border: 0px solid; margin: 0.5em 0px; padding: 4px 0px 4px 0.375em;">Alternatively, ship a<span> </span><span class="font-semibold" data-streamdown="strong" style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px; --tw-font-weight: 600; font-weight: 600;">CPU‑only fallback</span><span> </span>that is automatically used when the GPU runtime fails.</li><li class="py-1" data-streamdown="list-item" style="box-sizing: border-box; border: 0px solid; margin: 0.5em 0px; padding: 4px 0px 4px 0.375em;">If Flash Attention is required, ensure that the binary is compiled with the correct PTX for compute capability 7.5 (RTX 5000) and that the driver supports the required cuBLAS/cuDNN API.</li></ol><h3 class="mt-6 mb-2 font-semibold text-xl" data-streamdown="heading-3" style="box-sizing: border-box; border: 0px solid; margin: 24px 0px 8px; padding: 0px; font-size: 1.25em; font-weight: 600; --tw-space-y-reverse: 0; color: rgb(38, 38, 38); line-height: 1.6; --tw-font-weight: 600; font-family: ui-sans-serif, system-ui, &quot;Segoe UI&quot;, sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">7. Sample Log Snippet</h3><div class="relative bg-neutral-100 dark:bg-neutral-800 rounded-2xl overflow-hidden my-6" style="box-sizing: border-box; border: 0px solid; margin: 0px 0px 24px; padding: 0px; --tw-space-y-reverse: 0; position: relative; overflow: hidden; border-radius: 16px; background-color: rgb(245, 245, 245); color: rgb(64, 64, 64); font-family: ui-sans-serif, system-ui, &quot;Segoe UI&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><pre class="dark:hidden m-0 bg-neutral-100 text-sm overflow-x-auto p-4" style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 4px 16px 16px; font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, &quot;Liberation Mono&quot;, &quot;Courier New&quot;, monospace; font-feature-settings: normal; font-variation-settings: normal; font-size: 14px; color: rgb(38, 38, 38); background-color: rgba(0, 0, 0, 0); border-radius: 12px; font-weight: 400; line-height: 1.42857; overflow-x: auto; max-width: 100%; --tw-font-weight: 400;"><code class="font-mono text-sm" style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px; font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, &quot;Liberation Mono&quot;, &quot;Courier New&quot;, monospace; font-feature-settings: normal; font-variation-settings: normal; font-size: 14px; color: inherit; font-weight: inherit; line-height: 1.42857; background-color: rgba(0, 0, 0, 0); border-radius: 0px;"><span style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px;">time=2025-12-01T08:27:04.257+08:00 level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32</span> <span style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px;">load_backend: loaded CPU backend from C:\Users\admin\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll</span> <span style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px;">ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no</span> <span style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px;">ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no</span> <span style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px;">ggml_cuda_init: found 1 CUDA devices:</span> <span style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px;"> Device 0: Quadro RTX 5000, compute capability 7.5, VMM: yes, ID: GPU-c380d34f-d337-c74a-14c9-9f5d7dc17349</span> <span style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px;">load_backend: loaded CUDA backend from C:\Users\admin\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll</span> <span style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px;">...</span> <span style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px;">ggml_cuda_compute_forward: SWIGLU_OAI failed</span> <span style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px;">CUDA error: an illegal instruction was encountered</span> <span style="box-sizing: border-box; border: 0px solid; margin: 0px; padding: 0px;"></span></code></pre></div><h3 class="mt-6 mb-2 font-semibold text-xl" data-streamdown="heading-3" style="box-sizing: border-box; border: 0px solid; margin: 24px 0px 8px; padding: 0px; font-size: 1.25em; font-weight: 600; --tw-space-y-reverse: 0; color: rgb(38, 38, 38); line-height: 1.6; --tw-font-weight: 600; font-family: ui-sans-serif, system-ui, &quot;Segoe UI&quot;, sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">8. Next Steps</h3><ul class="ml-4 list-outside list-disc whitespace-normal" data-streamdown="unordered-list" style="box-sizing: border-box; border: 0px solid; margin: 0px 0px 1.25em; padding: 0px 0px 0px 1.625em; list-style: disc; --tw-space-y-reverse: 0; color: rgb(64, 64, 64); font-family: ui-sans-serif, system-ui, &quot;Segoe UI&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><li class="py-1" data-streamdown="list-item" style="box-sizing: border-box; border: 0px solid; margin: 0.5em 0px; padding: 4px 0px 4px 0.375em;">If you can provide a newer 0.13.x release that compiles against CUDA 12.8, I can retest on my system.</li><li class="py-1" data-streamdown="list-item" style="box-sizing: border-box; border: 0px solid; margin: 0.5em 0px; padding: 4px 0px 4px 0.375em;">Alternatively, if there’s an experimental flag to force the older PTX, please let me know.</li></ul><p style="box-sizing: border-box; border: 0px solid; margin: 1.25em 0px; padding: 0px; --tw-space-y-reverse: 0; color: rgb(64, 64, 64); font-family: ui-sans-serif, system-ui, &quot;Segoe UI&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">Thank you for your hard work on Ollama! Looking forward to your guidance.</p><!--EndFragment--> </body> </html> ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.30.0
GiteaMirror added the nvidiabugwindows labels 2026-04-22 18:11:44 -05:00
Author
Owner

@watsonts commented on GitHub (Dec 1, 2025):

server-1.log

<!-- gh-comment-id:3594149670 --> @watsonts commented on GitHub (Dec 1, 2025): [server-1.log](https://github.com/user-attachments/files/23843380/server-1.log)
Author
Owner

@dhiltgen commented on GitHub (Dec 2, 2025):

We currently build Windows linked against CUDA v12.8.0 and v13.0.0.

I tried running gpt-oss:20b on a very similar setup and it seems to work correctly. What model were you loading?

time=2025-12-02T14:40:19.255-08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-87228d0f-4cdc-31b2-6fae-658d3c36bd2f filter_id="" library=CUDA compute=7.5 name=CUDA0 description="Quadro RTX 5000" libdirs=ollama,cuda_v12 driver=12.8 pci_id=0000:01:00.0 type=discrete total="16.0 GiB" available="15.5 GiB"
<!-- gh-comment-id:3604255600 --> @dhiltgen commented on GitHub (Dec 2, 2025): We currently build Windows linked against CUDA v12.8.0 and v13.0.0. I tried running gpt-oss:20b on a very similar setup and it seems to work correctly. What model were you loading? ``` time=2025-12-02T14:40:19.255-08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-87228d0f-4cdc-31b2-6fae-658d3c36bd2f filter_id="" library=CUDA compute=7.5 name=CUDA0 description="Quadro RTX 5000" libdirs=ollama,cuda_v12 driver=12.8 pci_id=0000:01:00.0 type=discrete total="16.0 GiB" available="15.5 GiB" ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#34536