[GH-ISSUE #15249] gemma4:31b-it-q4_K_M fails to load with CUDA error: cublasGemmBatchedEx internal operation failed (v0.20.0, RTX 4090) #9754

Closed
opened 2026-04-12 22:38:48 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @RAFOLIE on GitHub (Apr 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15249

Originally assigned to: @dhiltgen on GitHub.

Description

When trying to run gemma4:31b-it-q4_K_M, the model fails to load with a 500 error. The server log shows a CUDA internal error during the vision encoder initialization phase.

Environment

  • Ollama version: 0.20.0
  • OS: Windows 11
  • GPU: NVIDIA GeForce RTX 4090 (24 GiB VRAM, ~20.5 GiB available)
  • CUDA Driver version: 595.79 (CUDA 13.2)
  • RAM: 64 GiB (~49 GiB free)
  • Model: gemma4:31b-it-q4_K_M (19 GB, Q4_K_M)

Steps to Reproduce

  1. Pull gemma4:31b-it-q4_K_M
  2. Send a chat request (text only, no image)
  3. Model fails to load with HTTP 500

Error from server.log

time=2026-04-03T09:26:08.106+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=99.5538ms shape="[5376 256]"
CUDA error: an internal operation failed
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at ggml-cuda.cu:2130
  cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), cu_data_type_a, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), cu_data_type_b, s11, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne0, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error

time=2026-04-03T09:26:30.753+08:00 level=ERROR source=server.go:1207 msg="do load request" error="Post http://127.0.0.1:60141/load: wsarecv: An existing connection was forcibly closed by the remote host."
time=2026-04-03T09:26:30.753+08:00 level=INFO source=sched.go:511 msg="Load failed" error="model failed to load, this may be due to resource limitations or an internal error"
time=2026-04-03T09:26:31.036+08:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"

Notes

  • The error consistently occurs right after vision: encoded log line, during the vision projector's batched matrix multiplication (cublasGemmBatchedEx).
  • Resources are sufficient: 20+ GiB free VRAM, 49 GiB free RAM.
  • The same error occurs on every repeated attempt.
  • OLLAMA_FLASH_ATTENTION is currently disabled (false).
  • OLLAMA_CONTEXT_LENGTH is set to 262144.
  • OLLAMA_NUM_PARALLEL is set to 4.
Originally created by @RAFOLIE on GitHub (Apr 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15249 Originally assigned to: @dhiltgen on GitHub. ## Description When trying to run `gemma4:31b-it-q4_K_M`, the model fails to load with a 500 error. The server log shows a CUDA internal error during the vision encoder initialization phase. ## Environment - **Ollama version**: 0.20.0 - **OS**: Windows 11 - **GPU**: NVIDIA GeForce RTX 4090 (24 GiB VRAM, ~20.5 GiB available) - **CUDA Driver version**: 595.79 (CUDA 13.2) - **RAM**: 64 GiB (~49 GiB free) - **Model**: `gemma4:31b-it-q4_K_M` (19 GB, Q4_K_M) ## Steps to Reproduce 1. Pull `gemma4:31b-it-q4_K_M` 2. Send a chat request (text only, no image) 3. Model fails to load with HTTP 500 ## Error from server.log ``` time=2026-04-03T09:26:08.106+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=99.5538ms shape="[5376 256]" CUDA error: an internal operation failed current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at ggml-cuda.cu:2130 cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), cu_data_type_a, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), cu_data_type_b, s11, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne0, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error time=2026-04-03T09:26:30.753+08:00 level=ERROR source=server.go:1207 msg="do load request" error="Post http://127.0.0.1:60141/load: wsarecv: An existing connection was forcibly closed by the remote host." time=2026-04-03T09:26:30.753+08:00 level=INFO source=sched.go:511 msg="Load failed" error="model failed to load, this may be due to resource limitations or an internal error" time=2026-04-03T09:26:31.036+08:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1" ``` ## Notes - The error consistently occurs right after `vision: encoded` log line, during the vision projector's batched matrix multiplication (`cublasGemmBatchedEx`). - Resources are sufficient: 20+ GiB free VRAM, 49 GiB free RAM. - The same error occurs on every repeated attempt. - `OLLAMA_FLASH_ATTENTION` is currently disabled (false). - `OLLAMA_CONTEXT_LENGTH` is set to 262144. - `OLLAMA_NUM_PARALLEL` is set to 4.
Author
Owner

@dhiltgen commented on GitHub (Apr 3, 2026):

Most likely this is an OOM corner case we're not handling properly. Until we can fix it to back-off, you should be able to work around it by reducing your num parallel or context size.

There's a corruption bug on the memory prediction path - we'll get a patch release out soon to fix this.

<!-- gh-comment-id:4183747822 --> @dhiltgen commented on GitHub (Apr 3, 2026): ~~Most likely this is an OOM corner case we're not handling properly. Until we can fix it to back-off, you should be able to work around it by reducing your num parallel or context size.~~ There's a corruption bug on the memory prediction path - we'll get a patch release out soon to fix this.
Author
Owner

@dhiltgen commented on GitHub (Apr 3, 2026):

It looks like the crash is related to OLLAMA_NUM_PARALLEL larger than 1 - as a short-term workaround until we get a patch release out, set OLLAMA_NUM_PARALLEL=1 and you should avoid the crash.

<!-- gh-comment-id:4184506615 --> @dhiltgen commented on GitHub (Apr 3, 2026): It looks like the crash is related to OLLAMA_NUM_PARALLEL larger than 1 - as a short-term workaround until we get a patch release out, set OLLAMA_NUM_PARALLEL=1 and you should avoid the crash.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9754