[GH-ISSUE #13614] CUDA tensor upload failures during model load #55471

Closed
opened 2026-04-29 09:16:29 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @nabbi on GitHub (Jan 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13614

What is the issue?

CUDA failures are observed during embedded textmodel load in the ggml CUDA backend, typically late in the load process. Failures manifest as cudaMemcpyAsync returning invalid argument, often preceded by cuMemGetAddressRange returning CUDA_ERROR_NOT_FOUND for the destination pointer.

ollama run nomic-embed-text "test"

Note that I reviewed upstream ggml-cuda.cu and they do not have this cudaDeviceReset() call.
It looks like Ollama is pulling it in from a local 0022-ggml-Enable-resetting-backend-devices.patch

Context / related code changes

These are my local commits; referenced to provide context for the current code state and the diagnostics used below:
I feel this is more of a workaround and needs a broader review by the project maintainers for a proper fix.

40f11ffe28
Disables cudaDeviceReset() during device reset handling.

3554b32cef
Adds additional CUDA diagnostics to aid troubleshooting of tensor upload failures.

Observed behavior

During model load (typically around 90–95% progress), tensor uploads fail in ggml_backend_cuda_buffer_set_tensor().

Representative events observed in sequence:

Tensor upload begins with valid-looking device pointers and offsets.

cuMemGetAddressRange() fails with CUDA_ERROR_NOT_FOUND for the destination pointer.

No out-of-bounds condition is detected for the allocation.

cudaMemcpyAsync(... cudaMemcpyHostToDevice ...) fails with invalid argument.

No prior CUDA runtime error is reported by cudaGetLastError().

Diagnostics captured

With additional diagnostics enabled, the following information was captured immediately before the failing memcpy:

Current CUDA device vs expected device

Result of cuCtxGetCurrent() on the calling thread

Pointer attributes for the base tensor pointer and offset pointer

Allocation base, length, and bounds check

Latent CUDA runtime error state

Example diagnostic output (abridged):


[GGML-CUDA-DIAG] cuCtxGetCurrent BAD cr=0 CUDA_SUCCESS ctx=(nil)
[GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x... dst1=0x...
[GGML-CUDA-DIAG] cuMemGetAddressRange FAILED cr=500 CUDA_ERROR_NOT_FOUND p=0x...
CUDA error: invalid argument

Expected behavior

Tensor uploads during model load should not fail when using valid device pointers within allocation bounds.

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.13.5 / main

Originally created by @nabbi on GitHub (Jan 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13614 ### What is the issue? CUDA failures are observed during embedded textmodel load in the ggml CUDA backend, typically late in the load process. Failures manifest as cudaMemcpyAsync returning invalid argument, often preceded by cuMemGetAddressRange returning CUDA_ERROR_NOT_FOUND for the destination pointer. `ollama run nomic-embed-text "test"` Note that I reviewed upstream [ggml-cuda.cu](https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-cuda/ggml-cuda.cu) and they do not have this cudaDeviceReset() call. It looks like Ollama is pulling it in from a local [0022-ggml-Enable-resetting-backend-devices.patch](https://github.com/ollama/ollama/blob/main/llama/patches/0022-ggml-Enable-resetting-backend-devices.patch) Context / related code changes These are my local commits; referenced to provide context for the current code state and the diagnostics used below: I feel this is more of a workaround and needs a broader review by the project maintainers for a proper fix. https://github.com/ollama/ollama/commit/40f11ffe28b7b94f39e4d4ed668c9d927cdd6e1a Disables cudaDeviceReset() during device reset handling. https://github.com/ollama/ollama/commit/3554b32ceff70200375766e1962b9aaf29b9987d Adds additional CUDA diagnostics to aid troubleshooting of tensor upload failures. Observed behavior During model load (typically around 90–95% progress), tensor uploads fail in ggml_backend_cuda_buffer_set_tensor(). Representative events observed in sequence: Tensor upload begins with valid-looking device pointers and offsets. `cuMemGetAddressRange() fails with CUDA_ERROR_NOT_FOUND for the destination pointer.` No out-of-bounds condition is detected for the allocation. `cudaMemcpyAsync(... cudaMemcpyHostToDevice ...) fails with invalid argument.` No prior CUDA runtime error is reported by cudaGetLastError(). Diagnostics captured With additional diagnostics enabled, the following information was captured immediately before the failing memcpy: Current CUDA device vs expected device Result of cuCtxGetCurrent() on the calling thread Pointer attributes for the base tensor pointer and offset pointer Allocation base, length, and bounds check Latent CUDA runtime error state Example diagnostic output (abridged): ``` [GGML-CUDA-DIAG] cuCtxGetCurrent BAD cr=0 CUDA_SUCCESS ctx=(nil) [GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x... dst1=0x... [GGML-CUDA-DIAG] cuMemGetAddressRange FAILED cr=500 CUDA_ERROR_NOT_FOUND p=0x... CUDA error: invalid argument ``` Expected behavior Tensor uploads during model load should not fail when using valid device pointers within allocation bounds. ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.13.5 / main
GiteaMirror added the bug label 2026-04-29 09:16:29 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 3, 2026):

Post the unabridged server log.

<!-- gh-comment-id:3707368528 --> @rick-github commented on GitHub (Jan 3, 2026): Post the unabridged server log.
Author
Owner

@nabbi commented on GitHub (Jan 4, 2026):

Allow me time to reproduce the full log again.
These are the sections I captured.

=2026-01-02T11:08:47.168-06:00 level=INFO source=device.go:272 msg="total memory" size="567.6 MiB"
time=2026-01-02T11:08:47.168-06:00 level=INFO source=sched.go:517 msg="loaded runners" count=1
time=2026-01-02T11:08:47.168-06:00 level=INFO source=server.go:1338 msg="waiting for llama runner to start responding"
time=2026-01-02T11:08:47.169-06:00 level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model"
time=2026-01-02T11:08:47.169-06:00 level=DEBUG source=server.go:1382 msg="model load progress 0.00"
time=2026-01-02T11:08:47.420-06:00 level=DEBUG source=server.go:1382 msg="model load progress 0.93"
CUDA error: invalid argument
  current device: 0, in function ggml_backend_cuda_buffer_set_tensor at /var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:752
  cudaMemcpyAsyncReserve((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, ((cudaStream_t)0x2))
/var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error
CUDA error: invalid argument
CUDA error: invalid argument
  current device: 0, in function ggml_backend_cuda_buffer_set_tensor at /var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:752
  current device: 0, in function ggml_backend_cuda_buffer_set_tensor at /var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:752
  cudaMemcpyAsyncReserve((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, ((cudaStream_t)0x2))
  cudaMemcpyAsyncReserve((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, ((cudaStream_t)0x2))
/var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error
/var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error
[New LWP 4710]
[New LWP 4709]

Sometimes the client side sees subsequent outputs after the Error, but usually it's just been that first error line.

$ ollama run nomic-embed-text test
Error: llama runner process has terminated: CUDA error: invalid argument
  current device: 0, in function ggml_backend_cuda_buffer_set_tensor at /var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:752
  cudaMemcpyAsyncReserve((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, ((cudaStream_t)0x2))
/var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error

Added diag outputs, which logged just before that mid-90% load

[GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x28260100 buffer=0x2971d3a0 data(host)=0xc0007ea000 dst0=0x7ff0af756c00 dst1=0x7ff0afa36c00 off=3014656 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2
[GGML-CUDA-DIAG] alloc base=0x7ff0a0000000 len=273521664 p=0x7ff0afa36c00 p_off=262368256 size=131072 end_off=262499328 OOB=0

[GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x282603e0 buffer=0x2971d3a0 data(host)=0xc00082a000 dst0=0x7ff0b0056c00 dst1=0x7ff0b0296c00 off=2359296 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2
[GGML-CUDA-DIAG] alloc base=0x7ff0a0000000 len=273521664 p=0x7ff0b0296c00 p_off=271150080 size=131072 end_off=271281152 OOB=0

[GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x28260270 buffer=0x2971d3a0 data(host)=0xc00080a000 dst0=0x7ff0afbd6c00 dst1=0x7ff0affd6c00 off=4194304 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2
[GGML-CUDA-DIAG] alloc base=0x7ff0a0000000 len=273521664 p=0x7ff0affd6c00 p_off=268266496 size=131072 end_off=268397568 OOB=0

[GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x282603e0 buffer=0x2971d3a0 data(host)=0xc00082a000 dst0=0x7ff0b0056c00 dst1=0x7ff0b02b6c00 off=2490368 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2
[GGML-CUDA-DIAG] alloc base=0x7ff0a0000000 len=273521664 p=0x7ff0b02b6c00 p_off=271281152 size=131072 end_off=271412224 OOB=0



[GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x28260100 buffer=0x2971d3a0 data(host)=0xc0007ea000 dst0=0x7ff0af756c00 dst1=0x7ff0afa56c00 off=3145728 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2
[GGML-CUDA-DIAG] cuMemGetAddressRange FAILED cr=709 CUDA_ERROR_CONTEXT_IS_DESTROYED context is destroyed p=0x7ff0afa56c00

[GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x28256e60 buffer=0x2971d3a0 data(host)=0xc0007b2000 dst0=0x7ff0a0000000 dst1=0x7ff0a2120000 off=34734080 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2
[GGML-CUDA-DIAG] cuMemGetAddressRange FAILED cr=709 CUDA_ERROR_CONTEXT_IS_DESTROYED context is destroyed p=0x7ff0a2120000

[GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x282603e0 buffer=0x2971d3a0 data(host)=0xc00082a000 dst0=0x7ff0b0056c00 dst1=0x7ff0b02d6c00 off=2621440 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2
[GGML-CUDA-DIAG] cuMemGetAddressRange FAILED cr=709 CUDA_ERROR_CONTEXT_IS_DESTROYED context is destroyed p=0x7ff0b02d6c00

[GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x28260270 buffer=0x2971d3a0 data(host)=0xc00080a000 dst0=0x7ff0afbd6c00 dst1=0x7ff0afff6c00 off=4325376 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2
[GGML-CUDA-DIAG] cuMemGetAddressRange FAILED cr=709 CUDA_ERROR_CONTEXT_IS_DESTROYED context is destroyed p=0x7ff0afff6c00
<!-- gh-comment-id:3707492341 --> @nabbi commented on GitHub (Jan 4, 2026): Allow me time to reproduce the full log again. These are the sections I captured. ```text =2026-01-02T11:08:47.168-06:00 level=INFO source=device.go:272 msg="total memory" size="567.6 MiB" time=2026-01-02T11:08:47.168-06:00 level=INFO source=sched.go:517 msg="loaded runners" count=1 time=2026-01-02T11:08:47.168-06:00 level=INFO source=server.go:1338 msg="waiting for llama runner to start responding" time=2026-01-02T11:08:47.169-06:00 level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model" time=2026-01-02T11:08:47.169-06:00 level=DEBUG source=server.go:1382 msg="model load progress 0.00" time=2026-01-02T11:08:47.420-06:00 level=DEBUG source=server.go:1382 msg="model load progress 0.93" CUDA error: invalid argument current device: 0, in function ggml_backend_cuda_buffer_set_tensor at /var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:752 cudaMemcpyAsyncReserve((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, ((cudaStream_t)0x2)) /var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error CUDA error: invalid argument CUDA error: invalid argument current device: 0, in function ggml_backend_cuda_buffer_set_tensor at /var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:752 current device: 0, in function ggml_backend_cuda_buffer_set_tensor at /var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:752 cudaMemcpyAsyncReserve((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, ((cudaStream_t)0x2)) cudaMemcpyAsyncReserve((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, ((cudaStream_t)0x2)) /var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error /var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error [New LWP 4710] [New LWP 4709] ``` Sometimes the client side sees subsequent outputs after the Error, but usually it's just been that first error line. ```text $ ollama run nomic-embed-text test Error: llama runner process has terminated: CUDA error: invalid argument current device: 0, in function ggml_backend_cuda_buffer_set_tensor at /var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:752 cudaMemcpyAsyncReserve((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, ((cudaStream_t)0x2)) /var/tmp/portage/sci-ml/ollama-0.13.5-r2/work/ollama-0.13.5/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error ``` Added diag outputs, which logged just before that mid-90% load ```text [GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x28260100 buffer=0x2971d3a0 data(host)=0xc0007ea000 dst0=0x7ff0af756c00 dst1=0x7ff0afa36c00 off=3014656 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2 [GGML-CUDA-DIAG] alloc base=0x7ff0a0000000 len=273521664 p=0x7ff0afa36c00 p_off=262368256 size=131072 end_off=262499328 OOB=0 [GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x282603e0 buffer=0x2971d3a0 data(host)=0xc00082a000 dst0=0x7ff0b0056c00 dst1=0x7ff0b0296c00 off=2359296 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2 [GGML-CUDA-DIAG] alloc base=0x7ff0a0000000 len=273521664 p=0x7ff0b0296c00 p_off=271150080 size=131072 end_off=271281152 OOB=0 [GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x28260270 buffer=0x2971d3a0 data(host)=0xc00080a000 dst0=0x7ff0afbd6c00 dst1=0x7ff0affd6c00 off=4194304 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2 [GGML-CUDA-DIAG] alloc base=0x7ff0a0000000 len=273521664 p=0x7ff0affd6c00 p_off=268266496 size=131072 end_off=268397568 OOB=0 [GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x282603e0 buffer=0x2971d3a0 data(host)=0xc00082a000 dst0=0x7ff0b0056c00 dst1=0x7ff0b02b6c00 off=2490368 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2 [GGML-CUDA-DIAG] alloc base=0x7ff0a0000000 len=273521664 p=0x7ff0b02b6c00 p_off=271281152 size=131072 end_off=271412224 OOB=0 [GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x28260100 buffer=0x2971d3a0 data(host)=0xc0007ea000 dst0=0x7ff0af756c00 dst1=0x7ff0afa56c00 off=3145728 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2 [GGML-CUDA-DIAG] cuMemGetAddressRange FAILED cr=709 CUDA_ERROR_CONTEXT_IS_DESTROYED context is destroyed p=0x7ff0afa56c00 [GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x28256e60 buffer=0x2971d3a0 data(host)=0xc0007b2000 dst0=0x7ff0a0000000 dst1=0x7ff0a2120000 off=34734080 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2 [GGML-CUDA-DIAG] cuMemGetAddressRange FAILED cr=709 CUDA_ERROR_CONTEXT_IS_DESTROYED context is destroyed p=0x7ff0a2120000 [GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x282603e0 buffer=0x2971d3a0 data(host)=0xc00082a000 dst0=0x7ff0b0056c00 dst1=0x7ff0b02d6c00 off=2621440 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2 [GGML-CUDA-DIAG] cuMemGetAddressRange FAILED cr=709 CUDA_ERROR_CONTEXT_IS_DESTROYED context is destroyed p=0x7ff0b02d6c00 [GGML-CUDA-DIAG] dev(ctx)=0 dev(cur)=0 tensor=0x28260270 buffer=0x2971d3a0 data(host)=0xc00080a000 dst0=0x7ff0afbd6c00 dst1=0x7ff0afff6c00 off=4325376 size=131072 attr0=0(no error) type0=2 attr1=0(no error) type1=2 [GGML-CUDA-DIAG] cuMemGetAddressRange FAILED cr=709 CUDA_ERROR_CONTEXT_IS_DESTROYED context is destroyed p=0x7ff0afff6c00 ```
Author
Owner

@nabbi commented on GitHub (Jan 4, 2026):

I've uploaded the full logs to nabbi/tshoot-ollama-13614

serve-error-diag.log depicts CUDA context dies mid-load (CUDA_ERROR_CONTEXT_IS_DESTROYED), causing subsequent memcpy calls to fail with ‘invalid argument’ and ggml to abort

Thank you

<!-- gh-comment-id:3708091132 --> @nabbi commented on GitHub (Jan 4, 2026): I've uploaded the full logs to [nabbi/tshoot-ollama-13614](https://github.com/nabbi/tshoot-ollama-13614) [serve-error-diag.log](https://raw.githubusercontent.com/nabbi/tshoot-ollama-13614/refs/heads/main/logs-cuda-error/serve-error-diag.log) depicts CUDA context dies mid-load (CUDA_ERROR_CONTEXT_IS_DESTROYED), causing subsequent memcpy calls to fail with ‘invalid argument’ and ggml to abort Thank you
Author
Owner

@rick-github commented on GitHub (Jan 5, 2026):

The runner is loading multiple copies of the backends:

time=2026-01-04T07:20:38.528-06:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib64/ollama
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, ID: GPU-81cc0f98-96b6-0d33-bfa1-5989f1d84393
load_backend: loaded CUDA backend from /usr/lib64/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib64/ollama/libggml-cpu-x64.so
time=2026-01-04T07:20:38.728-06:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib64/ollama/backends
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, ID: GPU-81cc0f98-96b6-0d33-bfa1-5989f1d84393
load_backend: loaded CUDA backend from /usr/lib64/ollama/backends/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib64/ollama/backends/libggml-cpu-x64.so

I suggest using the recommended install method and verifying the problem persists.

<!-- gh-comment-id:3708646295 --> @rick-github commented on GitHub (Jan 5, 2026): The runner is loading multiple copies of the backends: ``` time=2026-01-04T07:20:38.528-06:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib64/ollama ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, ID: GPU-81cc0f98-96b6-0d33-bfa1-5989f1d84393 load_backend: loaded CUDA backend from /usr/lib64/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib64/ollama/libggml-cpu-x64.so time=2026-01-04T07:20:38.728-06:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib64/ollama/backends ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, ID: GPU-81cc0f98-96b6-0d33-bfa1-5989f1d84393 load_backend: loaded CUDA backend from /usr/lib64/ollama/backends/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib64/ollama/backends/libggml-cpu-x64.so ``` I suggest using the [recommended install method](https://ollama.com/download) and verifying the problem persists.
Author
Owner

@nabbi commented on GitHub (Jan 5, 2026):

Thanks for the insights. I'll explore fixing our Gentoo's community ebuild packaging as your install script is focused on binary distros running systemd.

Located the suspected configuration issue.

        # backends end up in /usr/bin otherwise
        -DGGML_BACKEND_DL="yes"
        -DGGML_BACKEND_DIR="${EPREFIX}/usr/$(get_libdir)/${PN}/backends"

Confirmed the package is installing it twice. Great.

/usr/lib64/ollama/
├── backends
│   ├── libggml-cpu-x64.so
│   └── libggml-cuda.so
├── libggml-base.so -> libggml-base.so.0
├── libggml-base.so.0 -> libggml-base.so.0.0.0
├── libggml-base.so.0.0.0
├── libggml-cpu-x64.so
└── libggml-cuda.so

<!-- gh-comment-id:3710766527 --> @nabbi commented on GitHub (Jan 5, 2026): Thanks for the insights. I'll explore fixing our Gentoo's community ebuild packaging as your install script is focused on binary distros running systemd. Located the suspected configuration issue. ``` # backends end up in /usr/bin otherwise -DGGML_BACKEND_DL="yes" -DGGML_BACKEND_DIR="${EPREFIX}/usr/$(get_libdir)/${PN}/backends" ``` Confirmed the package is installing it twice. Great. ```shell /usr/lib64/ollama/ ├── backends │   ├── libggml-cpu-x64.so │   └── libggml-cuda.so ├── libggml-base.so -> libggml-base.so.0 ├── libggml-base.so.0 -> libggml-base.so.0.0.0 ├── libggml-base.so.0.0.0 ├── libggml-cpu-x64.so └── libggml-cuda.so ```
Author
Owner

@nabbi commented on GitHub (Jan 5, 2026):

@rick-github Thanks again

PR https://github.com/gentoo/guru/pull/409 into Gentoo Guru to fix this .
Otherwise it's also published oubliette-overlay

<!-- gh-comment-id:3710952762 --> @nabbi commented on GitHub (Jan 5, 2026): @rick-github Thanks again PR https://github.com/gentoo/guru/pull/409 into Gentoo Guru to fix this . Otherwise it's also published [oubliette-overlay](https://github.com/nabbi/oubliette-overlay/tree/master/sci-ml/ollama)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55471