[GH-ISSUE #11229] gemma3 do not run #53909

Closed
opened 2026-04-29 04:56:42 -05:00 by GiteaMirror · 25 comments
Owner

Originally created by @zerowxt on GitHub (Jun 28, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11229

What is the issue?

Error Msg

I pull gemma3n:e2b and gemma3:latest, error is llama runner process has terminated: error:fault.

System Info

  • Windows 11 26100.4484
  • Ollama 0.9.3

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @zerowxt on GitHub (Jun 28, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11229 ### What is the issue? ## Error Msg I pull gemma3n:e2b and gemma3:latest, error is llama runner process has terminated: error:fault. ## System Info - Windows 11 26100.4484 - Ollama 0.9.3 ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-29 04:56:42 -05:00
Author
Owner

@CRCODE22 commented on GitHub (Jun 28, 2025):

I also have the same error after Ollama updated after notifying me there is an update. I posted log information in here:

https://github.com/ollama/ollama/issues/11211#issuecomment-3014386830

<!-- gh-comment-id:3015170732 --> @CRCODE22 commented on GitHub (Jun 28, 2025): I also have the same error after Ollama updated after notifying me there is an update. I posted log information in here: https://github.com/ollama/ollama/issues/11211#issuecomment-3014386830
Author
Owner

@rick-github commented on GitHub (Jun 28, 2025):

@zerowxt Server logs will aid in debugging.

<!-- gh-comment-id:3015178669 --> @rick-github commented on GitHub (Jun 28, 2025): @zerowxt [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@zerowxt commented on GitHub (Jun 28, 2025):

server.log

<!-- gh-comment-id:3015200672 --> @zerowxt commented on GitHub (Jun 28, 2025): [server.log](https://github.com/user-attachments/files/20960440/server.log)
Author
Owner

@vgrechin commented on GitHub (Jun 28, 2025):

I also cannot run Gemma3:12b right after latest update on Thursday on Windows
Other models are fine. Version after the update is:
ollama version is 0.9.3

<!-- gh-comment-id:3015209890 --> @vgrechin commented on GitHub (Jun 28, 2025): I also cannot run Gemma3:12b right after latest update on Thursday on Windows Other models are fine. Version after the update is: ollama version is 0.9.3
Author
Owner

@Amigos-Flipado commented on GitHub (Jun 28, 2025):

Same here,

<!-- gh-comment-id:3015234709 --> @Amigos-Flipado commented on GitHub (Jun 28, 2025): Same here,
Author
Owner

@rick-github commented on GitHub (Jun 28, 2025):

@zerowxt

load_backend: loaded CUDA backend from D:\programmer\ai\ollama\lib\ollama\ggml-cuda.dll
load_backend: loaded CUDA backend from D:\programmer\ai\ollama\lib\ollama\cuda_v12\ggml-cuda.dll

Your runner is loading the CUDA backend twice from different locations, so this is probably #11211. See here for details.

<!-- gh-comment-id:3015242276 --> @rick-github commented on GitHub (Jun 28, 2025): @zerowxt ``` load_backend: loaded CUDA backend from D:\programmer\ai\ollama\lib\ollama\ggml-cuda.dll load_backend: loaded CUDA backend from D:\programmer\ai\ollama\lib\ollama\cuda_v12\ggml-cuda.dll ``` Your runner is loading the CUDA backend twice from different locations, so this is probably #11211. See [here](https://github.com/ollama/ollama/issues/11211#issuecomment-3014572069) for details.
Author
Owner

@sol8712 commented on GitHub (Jun 29, 2025):

Here are the logs from my Nvidia machine getting the error with the Gemma models. On my Raspberry pi's cpu i can run both models with no problem.

Ignore the weird paths to ollama, manual rootless install.

time=2025-06-29T02:00:15.548-07:00 level=INFO source=server.go:135 msg="system memory" total="31.2 GiB" free="27.5 GiB" free_swap="0 B"
time=2025-06-29T02:00:15.549-07:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=48 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.2 GiB" memory.required.partial="8.6 GiB" memory.required.kv="736.0 MiB" memory.required.allocations="[8.6 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="519.5 MiB" memory.graph.partial="1.3 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-06-29T02:00:15.591-07:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/home/ollama/bin/ollama runner --ollama-engine --model /mnt/HDD/AI/ollama/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 4096 --batch-size 512 --n-gpu-layers 48 --threads 6 --parallel 1 --port 38619"
time=2025-06-29T02:00:15.591-07:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-06-29T02:00:15.591-07:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-06-29T02:00:15.591-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-06-29T02:00:15.597-07:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-06-29T02:00:15.598-07:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:38619"
time=2025-06-29T02:00:15.638-07:00 level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=37
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /home/ollama/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /home/ollama/lib/ollama/libggml-cpu-alderlake.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /home/ollama/lib/ollama/cuda_v12/libggml-cuda.so
time=2025-06-29T02:00:15.679-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
unexpected fault address 0x2a9a50000
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x2 addr=0x2a9a50000 pc=0x561d4224b780]

goroutine 6 gp=0xc000187c00 m=10 mp=0xc0003ac008 [running]:
runtime.throw({0x561d43164479?, 0xc000187c00?})
        runtime/panic.go:1096 +0x4a fp=0xc000055110 sp=0xc0000550e0 pc=0x561d422bbc2a
runtime.sigpanic()
        runtime/signal_unix.go:939 +0x26c fp=0xc000055170 sp=0xc000055110 pc=0x561d422be0ac
indexbytebody()
        internal/bytealg/indexbyte_amd64.s:131 +0xe0 fp=0xc000055178 sp=0xc000055170 pc=0x561d4224b780
runtime.findnull(0xc0000551f8?)
        runtime/string.go:577 +0x79 fp=0xc0000551d0 sp=0xc000055178 pc=0x561d422a38b9
runtime.gostring(0x2a9a50000)
        runtime/string.go:363 +0x1c fp=0xc000055208 sp=0xc0000551d0 pc=0x561d422bef1c
github.com/ollama/ollama/ml/backend/ggml._Cfunc_GoString(...)
        _cgo_gotypes.go:311
github.com/ollama/ollama/ml/backend/ggml.New({0x7ffe1dde6db2, 0x60}, {0x6, 0x0, 0x30, {0x0, 0x0, 0x0}, 0x0})
        github.com/ollama/ollama/ml/backend/ggml/ggml.go:158 +0x1336 fp=0xc000055c18 sp=0xc000055208 pc=0x561d426f8376
github.com/ollama/ollama/ml.NewBackend({0x7ffe1dde6db2, 0x60}, {0x6, 0x0, 0x30, {0x0, 0x0, 0x0}, 0x0})
        github.com/ollama/ollama/ml/backend.go:209 +0xb1 fp=0xc000055c70 sp=0xc000055c18 pc=0x561d426e9e11
github.com/ollama/ollama/model.New({0x7ffe1dde6db2?, 0x0?}, {0x6, 0x0, 0x30, {0x0, 0x0, 0x0}, 0x0})
        github.com/ollama/ollama/model/model.go:102 +0x8f fp=0xc000055d68 sp=0xc000055c70 pc=0x561d42707a2f
github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc000034900, {0x7ffe1dde6db2?, 0x0?}, {0x6, 0x0, 0x30, {0x0, 0x0, 0x0}, 0x0}, ...)
        github.com/ollama/ollama/runner/ollamarunner/runner.go:841 +0x8d fp=0xc000055dc8 sp=0xc000055d68 pc=0x561d427a9b6d
github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc000034900, {0x561d4361c880, 0xc0000f4460}, {0x7ffe1dde6db2?, 0x0?}, {0x6, 0x0, 0x30, {0x0, 0x0, ...}, ...}, ...)
        github.com/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8 fp=0xc000055f20 sp=0xc000055dc8 pc=0x561d427a9ed8
github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap1()
        github.com/ollama/ollama/runner/ollamarunner/runner.go:959 +0xc7 fp=0xc000055fe0 sp=0xc000055f20 pc=0x561d427ab307
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000055fe8 sp=0xc000055fe0 pc=0x561d422c3481
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
        github.com/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11

goroutine 1 gp=0xc000002380 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000605650 sp=0xc000605630 pc=0x561d422bbd4e
runtime.netpollblock(0x561d422b9ad3?, 0x42254b46?, 0x1d?)
        runtime/netpoll.go:575 +0xf7 fp=0xc000605688 sp=0xc000605650 pc=0x561d42280837
internal/poll.runtime_pollWait(0x7fdf70020de0, 0x72)
        runtime/netpoll.go:351 +0x85 fp=0xc0006056a8 sp=0xc000605688 pc=0x561d422baf65
internal/poll.(*pollDesc).wait(0xc00001a500?, 0x380016?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0006056d0 sp=0xc0006056a8 pc=0x561d423423a7
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc00001a500)
        internal/poll/fd_unix.go:620 +0x295 fp=0xc000605778 sp=0xc0006056d0 pc=0x561d42347775
net.(*netFD).accept(0xc00001a500)
        net/fd_unix.go:172 +0x29 fp=0xc000605830 sp=0xc000605778 pc=0x561d423b9c89
net.(*TCPListener).accept(0xc000379240)
        net/tcpsock_posix.go:159 +0x1b fp=0xc000605880 sp=0xc000605830 pc=0x561d423cf63b
net.(*TCPListener).Accept(0xc000379240)
        net/tcpsock.go:380 +0x30 fp=0xc0006058b0 sp=0xc000605880 pc=0x561d423ce4f0
net/http.(*onceCloseListener).Accept(0x561d4361c810?)
        <autogenerated>:1 +0x24 fp=0xc0006058c8 sp=0xc0006058b0 pc=0x561d425e5c44
net/http.(*Server).Serve(0xc00026f600, {0x561d4361a408, 0xc000379240})
        net/http/server.go:3424 +0x30c fp=0xc0006059f8 sp=0xc0006058c8 pc=0x561d425bd50c
github.com/ollama/ollama/runner/ollamarunner.Execute({0xc0001ac030, 0xe, 0xf})
        github.com/ollama/ollama/runner/ollamarunner/runner.go:984 +0xe09 fp=0xc000605d08 sp=0xc0006059f8 pc=0x561d427aaf69
github.com/ollama/ollama/runner.Execute({0xc0001ac010?, 0x0?, 0x0?})
        github.com/ollama/ollama/runner/runner.go:20 +0xc9 fp=0xc000605d30 sp=0xc000605d08 pc=0x561d427ab869
github.com/ollama/ollama/cmd.NewCLI.func2(0xc00026f400?, {0x561d43163075?, 0x4?, 0x561d43163079?})
        github.com/ollama/ollama/cmd/cmd.go:1529 +0x45 fp=0xc000605d58 sp=0xc000605d30 pc=0x561d42f08645
github.com/spf13/cobra.(*Command).execute(0xc000020f08, {0xc0000b8ff0, 0xf, 0xf})
        github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc000605e78 sp=0xc000605d58 pc=0x561d424332dc
github.com/spf13/cobra.(*Command).ExecuteC(0xc0000e0908)
        github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc000605f30 sp=0xc000605e78 pc=0x561d42433b25
github.com/spf13/cobra.(*Command).Execute(...)
        github.com/spf13/cobra@v1.7.0/command.go:992
github.com/spf13/cobra.(*Command).ExecuteContext(...)
        github.com/spf13/cobra@v1.7.0/command.go:985
main.main()
        github.com/ollama/ollama/main.go:12 +0x4d fp=0xc000605f50 sp=0xc000605f30 pc=0x561d42f090cd
runtime.main()
        runtime/proc.go:283 +0x29d fp=0xc000605fe0 sp=0xc000605f50 pc=0x561d42287ebd
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000605fe8 sp=0xc000605fe0 pc=0x561d422c3481

goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x561d422bbd4e
runtime.goparkunlock(...)
        runtime/proc.go:441
runtime.forcegchelper()
        runtime/proc.go:348 +0xb8 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x561d422881f8
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x561d422c3481
created by runtime.init.7 in goroutine 1
        runtime/proc.go:336 +0x1a

goroutine 3 gp=0xc000003340 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000085780 sp=0xc000085760 pc=0x561d422bbd4e
runtime.goparkunlock(...)
        runtime/proc.go:441
runtime.bgsweep(0xc000042080)
        runtime/mgcsweep.go:316 +0xdf fp=0xc0000857c8 sp=0xc000085780 pc=0x561d4227299f
runtime.gcenable.gowrap1()
        runtime/mgc.go:204 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x561d42266d85
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x561d422c3481
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:204 +0x66

goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait]:
runtime.gopark(0x10000?, 0x561d43322c38?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x561d422bbd4e
runtime.goparkunlock(...)
        runtime/proc.go:441
runtime.(*scavengerState).park(0x561d43eaa8e0)
        runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x561d422703e9
runtime.bgscavenge(0xc000042080)
        runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x561d42270979
runtime.gcenable.gowrap2()
        runtime/mgc.go:205 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x561d42266d25
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x561d422c3481
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:205 +0xa5

goroutine 18 gp=0xc000186380 m=nil [finalizer wait]:
runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000084688?)
        runtime/proc.go:435 +0xce fp=0xc000084630 sp=0xc000084610 pc=0x561d422bbd4e
runtime.runfinq()
        runtime/mfinal.go:196 +0x107 fp=0xc0000847e0 sp=0xc000084630 pc=0x561d42265d47
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x561d422c3481
created by runtime.createfing in goroutine 1
        runtime/mfinal.go:166 +0x3d

goroutine 19 gp=0xc000186e00 m=nil [chan receive]:
runtime.gopark(0xc0002a17c0?, 0xc000598018?, 0x60?, 0x7?, 0x561d423a09c8?)
        runtime/proc.go:435 +0xce fp=0xc000080718 sp=0xc0000806f8 pc=0x561d422bbd4e
runtime.chanrecv(0xc000182310, 0x0, 0x1)
        runtime/chan.go:664 +0x445 fp=0xc000080790 sp=0xc000080718 pc=0x561d42257725
runtime.chanrecv1(0x0?, 0x0?)
        runtime/chan.go:506 +0x12 fp=0xc0000807b8 sp=0xc000080790 pc=0x561d422572b2
runtime.unique_runtime_registerUniqueMapCleanup.func2(...)
        runtime/mgc.go:1796
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
        runtime/mgc.go:1799 +0x2f fp=0xc0000807e0 sp=0xc0000807b8 pc=0x561d42269f2f
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x561d422c3481
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
        runtime/mgc.go:1794 +0x85

goroutine 20 gp=0xc000187180 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000080f38 sp=0xc000080f18 pc=0x561d422bbd4e
runtime.gcBgMarkWorker(0xc000183730)
        runtime/mgc.go:1423 +0xe9 fp=0xc000080fc8 sp=0xc000080f38 pc=0x561d42269249
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc000080fe0 sp=0xc000080fc8 pc=0x561d42269125
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x561d422c3481
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 34 gp=0xc000102380 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000118738 sp=0xc000118718 pc=0x561d422bbd4e
runtime.gcBgMarkWorker(0xc000183730)
        runtime/mgc.go:1423 +0xe9 fp=0xc0001187c8 sp=0xc000118738 pc=0x561d42269249
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc0001187e0 sp=0xc0001187c8 pc=0x561d42269125
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0001187e8 sp=0xc0001187e0 pc=0x561d422c3481
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 35 gp=0xc000102540 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000118f38 sp=0xc000118f18 pc=0x561d422bbd4e
runtime.gcBgMarkWorker(0xc000183730)
        runtime/mgc.go:1423 +0xe9 fp=0xc000118fc8 sp=0xc000118f38 pc=0x561d42269249
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc000118fe0 sp=0xc000118fc8 pc=0x561d42269125
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000118fe8 sp=0xc000118fe0 pc=0x561d422c3481
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 36 gp=0xc000102700 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000119738 sp=0xc000119718 pc=0x561d422bbd4e
runtime.gcBgMarkWorker(0xc000183730)
        runtime/mgc.go:1423 +0xe9 fp=0xc0001197c8 sp=0xc000119738 pc=0x561d42269249
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc0001197e0 sp=0xc0001197c8 pc=0x561d42269125
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0001197e8 sp=0xc0001197e0 pc=0x561d422c3481
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 37 gp=0xc0001028c0 m=nil [GC worker (idle)]:
runtime.gopark(0xbc121deb01?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000119f38 sp=0xc000119f18 pc=0x561d422bbd4e
runtime.gcBgMarkWorker(0xc000183730)
        runtime/mgc.go:1423 +0xe9 fp=0xc000119fc8 sp=0xc000119f38 pc=0x561d42269249
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc000119fe0 sp=0xc000119fc8 pc=0x561d42269125
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000119fe8 sp=0xc000119fe0 pc=0x561d422c3481
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 38 gp=0xc000102a80 m=nil [GC worker (idle)]:
runtime.gopark(0xbc1247c9c3?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00011a738 sp=0xc00011a718 pc=0x561d422bbd4e
runtime.gcBgMarkWorker(0xc000183730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00011a7c8 sp=0xc00011a738 pc=0x561d42269249
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00011a7e0 sp=0xc00011a7c8 pc=0x561d42269125
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00011a7e8 sp=0xc00011a7e0 pc=0x561d422c3481
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 39 gp=0xc000102c40 m=nil [GC worker (idle)]:
runtime.gopark(0xbc121dec6a?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00011af38 sp=0xc00011af18 pc=0x561d422bbd4e
runtime.gcBgMarkWorker(0xc000183730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00011afc8 sp=0xc00011af38 pc=0x561d42269249
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00011afe0 sp=0xc00011afc8 pc=0x561d42269125
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00011afe8 sp=0xc00011afe0 pc=0x561d422c3481
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 40 gp=0xc000102e00 m=nil [GC worker (idle)]:
runtime.gopark(0x561d43f59120?, 0x1?, 0x9e?, 0x4c?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00011b738 sp=0xc00011b718 pc=0x561d422bbd4e
runtime.gcBgMarkWorker(0xc000183730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00011b7c8 sp=0xc00011b738 pc=0x561d42269249
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00011b7e0 sp=0xc00011b7c8 pc=0x561d42269125
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00011b7e8 sp=0xc00011b7e0 pc=0x561d422c3481
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 41 gp=0xc000102fc0 m=nil [GC worker (idle)]:
runtime.gopark(0xbc121e079f?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00011bf38 sp=0xc00011bf18 pc=0x561d422bbd4e
runtime.gcBgMarkWorker(0xc000183730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00011bfc8 sp=0xc00011bf38 pc=0x561d42269249
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00011bfe0 sp=0xc00011bfc8 pc=0x561d42269125
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00011bfe8 sp=0xc00011bfe0 pc=0x561d422c3481
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 42 gp=0xc000103180 m=nil [GC worker (idle)]:
runtime.gopark(0xbc121e39b5?, 0x1?, 0x7c?, 0x49?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000114738 sp=0xc000114718 pc=0x561d422bbd4e
runtime.gcBgMarkWorker(0xc000183730)
        runtime/mgc.go:1423 +0xe9 fp=0xc0001147c8 sp=0xc000114738 pc=0x561d42269249
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc0001147e0 sp=0xc0001147c8 pc=0x561d42269125
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0001147e8 sp=0xc0001147e0 pc=0x561d422c3481
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 43 gp=0xc000103340 m=nil [GC worker (idle)]:
runtime.gopark(0x561d43f59120?, 0x1?, 0x5a?, 0x33?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000114f38 sp=0xc000114f18 pc=0x561d422bbd4e
runtime.gcBgMarkWorker(0xc000183730)
        runtime/mgc.go:1423 +0xe9 fp=0xc000114fc8 sp=0xc000114f38 pc=0x561d42269249
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc000114fe0 sp=0xc000114fc8 pc=0x561d42269125
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000114fe8 sp=0xc000114fe0 pc=0x561d422c3481
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 5 gp=0xc000003a40 m=nil [GC worker (idle)]:
runtime.gopark(0xbc121e05f4?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000086738 sp=0xc000086718 pc=0x561d422bbd4e
runtime.gcBgMarkWorker(0xc000183730)
        runtime/mgc.go:1423 +0xe9 fp=0xc0000867c8 sp=0xc000086738 pc=0x561d42269249
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc0000867e0 sp=0xc0000867c8 pc=0x561d42269125
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000867e8 sp=0xc0000867e0 pc=0x561d422c3481
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 7 gp=0xc000187dc0 m=nil [sync.WaitGroup.Wait]:
runtime.gopark(0x0?, 0x0?, 0x80?, 0x1?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000087ed0 sp=0xc000087eb0 pc=0x561d422bbd4e
runtime.goparkunlock(...)
        runtime/proc.go:441
runtime.semacquire1(0xc000034908, 0x0, 0x1, 0x0, 0x18)
        runtime/sema.go:188 +0x229 fp=0xc000087f38 sp=0xc000087ed0 pc=0x561d4229b489
sync.runtime_SemacquireWaitGroup(0x0?)
        runtime/sema.go:110 +0x25 fp=0xc000087f70 sp=0xc000087f38 pc=0x561d422bd765
sync.(*WaitGroup).Wait(0x0?)
        sync/waitgroup.go:118 +0x48 fp=0xc000087f98 sp=0xc000087f70 pc=0x561d422cedc8
github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc000034900, {0x561d4361c880, 0xc0000f4460})
        github.com/ollama/ollama/runner/ollamarunner/runner.go:355 +0x25 fp=0xc000087fb8 sp=0xc000087f98 pc=0x561d427a5a85
github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap2()
        github.com/ollama/ollama/runner/ollamarunner/runner.go:960 +0x28 fp=0xc000087fe0 sp=0xc000087fb8 pc=0x561d427ab208
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000087fe8 sp=0xc000087fe0 pc=0x561d422c3481
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
        github.com/ollama/ollama/runner/ollamarunner/runner.go:960 +0xa74
time=2025-06-29T02:00:15.819-07:00 level=ERROR source=server.go:464 msg="llama runner terminated" error="exit status 2"
time=2025-06-29T02:00:15.842-07:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: error:fault"
[GIN] 2025/06/29 - 02:00:15 | 500 |  702.439409ms |       127.0.0.1 | POST     "/api/chat"
time=2025-06-29T02:00:20.982-07:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.140254431 runner.size="11.2 GiB" runner.vram="8.6 GiB" runner.parallel=1 runner.pid=2786 runner.model=/mnt/HDD/AI/ollama/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de
time=2025-06-29T02:00:21.233-07:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.390768099 runner.size="11.2 GiB" runner.vram="8.6 GiB" runner.parallel=1 runner.pid=2786 runner.model=/mnt/HDD/AI/ollama/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de
time=2025-06-29T02:00:21.482-07:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.640005646 runner.size="11.2 GiB" runner.vram="8.6 GiB" runner.parallel=1 runner.pid=2786 runner.model=/mnt/HDD/AI/ollama/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de

<!-- gh-comment-id:3016443565 --> @sol8712 commented on GitHub (Jun 29, 2025): Here are the logs from my Nvidia machine getting the error with the Gemma models. On my Raspberry pi's cpu i can run both models with no problem. Ignore the weird paths to ollama, manual rootless install. ``` time=2025-06-29T02:00:15.548-07:00 level=INFO source=server.go:135 msg="system memory" total="31.2 GiB" free="27.5 GiB" free_swap="0 B" time=2025-06-29T02:00:15.549-07:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=48 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.2 GiB" memory.required.partial="8.6 GiB" memory.required.kv="736.0 MiB" memory.required.allocations="[8.6 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="519.5 MiB" memory.graph.partial="1.3 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" time=2025-06-29T02:00:15.591-07:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/home/ollama/bin/ollama runner --ollama-engine --model /mnt/HDD/AI/ollama/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 4096 --batch-size 512 --n-gpu-layers 48 --threads 6 --parallel 1 --port 38619" time=2025-06-29T02:00:15.591-07:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-06-29T02:00:15.591-07:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-06-29T02:00:15.591-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-06-29T02:00:15.597-07:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-06-29T02:00:15.598-07:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:38619" time=2025-06-29T02:00:15.638-07:00 level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=37 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /home/ollama/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /home/ollama/lib/ollama/libggml-cpu-alderlake.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /home/ollama/lib/ollama/cuda_v12/libggml-cuda.so time=2025-06-29T02:00:15.679-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) unexpected fault address 0x2a9a50000 fatal error: fault [signal SIGSEGV: segmentation violation code=0x2 addr=0x2a9a50000 pc=0x561d4224b780] goroutine 6 gp=0xc000187c00 m=10 mp=0xc0003ac008 [running]: runtime.throw({0x561d43164479?, 0xc000187c00?}) runtime/panic.go:1096 +0x4a fp=0xc000055110 sp=0xc0000550e0 pc=0x561d422bbc2a runtime.sigpanic() runtime/signal_unix.go:939 +0x26c fp=0xc000055170 sp=0xc000055110 pc=0x561d422be0ac indexbytebody() internal/bytealg/indexbyte_amd64.s:131 +0xe0 fp=0xc000055178 sp=0xc000055170 pc=0x561d4224b780 runtime.findnull(0xc0000551f8?) runtime/string.go:577 +0x79 fp=0xc0000551d0 sp=0xc000055178 pc=0x561d422a38b9 runtime.gostring(0x2a9a50000) runtime/string.go:363 +0x1c fp=0xc000055208 sp=0xc0000551d0 pc=0x561d422bef1c github.com/ollama/ollama/ml/backend/ggml._Cfunc_GoString(...) _cgo_gotypes.go:311 github.com/ollama/ollama/ml/backend/ggml.New({0x7ffe1dde6db2, 0x60}, {0x6, 0x0, 0x30, {0x0, 0x0, 0x0}, 0x0}) github.com/ollama/ollama/ml/backend/ggml/ggml.go:158 +0x1336 fp=0xc000055c18 sp=0xc000055208 pc=0x561d426f8376 github.com/ollama/ollama/ml.NewBackend({0x7ffe1dde6db2, 0x60}, {0x6, 0x0, 0x30, {0x0, 0x0, 0x0}, 0x0}) github.com/ollama/ollama/ml/backend.go:209 +0xb1 fp=0xc000055c70 sp=0xc000055c18 pc=0x561d426e9e11 github.com/ollama/ollama/model.New({0x7ffe1dde6db2?, 0x0?}, {0x6, 0x0, 0x30, {0x0, 0x0, 0x0}, 0x0}) github.com/ollama/ollama/model/model.go:102 +0x8f fp=0xc000055d68 sp=0xc000055c70 pc=0x561d42707a2f github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc000034900, {0x7ffe1dde6db2?, 0x0?}, {0x6, 0x0, 0x30, {0x0, 0x0, 0x0}, 0x0}, ...) github.com/ollama/ollama/runner/ollamarunner/runner.go:841 +0x8d fp=0xc000055dc8 sp=0xc000055d68 pc=0x561d427a9b6d github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc000034900, {0x561d4361c880, 0xc0000f4460}, {0x7ffe1dde6db2?, 0x0?}, {0x6, 0x0, 0x30, {0x0, 0x0, ...}, ...}, ...) github.com/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8 fp=0xc000055f20 sp=0xc000055dc8 pc=0x561d427a9ed8 github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap1() github.com/ollama/ollama/runner/ollamarunner/runner.go:959 +0xc7 fp=0xc000055fe0 sp=0xc000055f20 pc=0x561d427ab307 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000055fe8 sp=0xc000055fe0 pc=0x561d422c3481 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11 goroutine 1 gp=0xc000002380 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000605650 sp=0xc000605630 pc=0x561d422bbd4e runtime.netpollblock(0x561d422b9ad3?, 0x42254b46?, 0x1d?) runtime/netpoll.go:575 +0xf7 fp=0xc000605688 sp=0xc000605650 pc=0x561d42280837 internal/poll.runtime_pollWait(0x7fdf70020de0, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc0006056a8 sp=0xc000605688 pc=0x561d422baf65 internal/poll.(*pollDesc).wait(0xc00001a500?, 0x380016?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0006056d0 sp=0xc0006056a8 pc=0x561d423423a7 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc00001a500) internal/poll/fd_unix.go:620 +0x295 fp=0xc000605778 sp=0xc0006056d0 pc=0x561d42347775 net.(*netFD).accept(0xc00001a500) net/fd_unix.go:172 +0x29 fp=0xc000605830 sp=0xc000605778 pc=0x561d423b9c89 net.(*TCPListener).accept(0xc000379240) net/tcpsock_posix.go:159 +0x1b fp=0xc000605880 sp=0xc000605830 pc=0x561d423cf63b net.(*TCPListener).Accept(0xc000379240) net/tcpsock.go:380 +0x30 fp=0xc0006058b0 sp=0xc000605880 pc=0x561d423ce4f0 net/http.(*onceCloseListener).Accept(0x561d4361c810?) <autogenerated>:1 +0x24 fp=0xc0006058c8 sp=0xc0006058b0 pc=0x561d425e5c44 net/http.(*Server).Serve(0xc00026f600, {0x561d4361a408, 0xc000379240}) net/http/server.go:3424 +0x30c fp=0xc0006059f8 sp=0xc0006058c8 pc=0x561d425bd50c github.com/ollama/ollama/runner/ollamarunner.Execute({0xc0001ac030, 0xe, 0xf}) github.com/ollama/ollama/runner/ollamarunner/runner.go:984 +0xe09 fp=0xc000605d08 sp=0xc0006059f8 pc=0x561d427aaf69 github.com/ollama/ollama/runner.Execute({0xc0001ac010?, 0x0?, 0x0?}) github.com/ollama/ollama/runner/runner.go:20 +0xc9 fp=0xc000605d30 sp=0xc000605d08 pc=0x561d427ab869 github.com/ollama/ollama/cmd.NewCLI.func2(0xc00026f400?, {0x561d43163075?, 0x4?, 0x561d43163079?}) github.com/ollama/ollama/cmd/cmd.go:1529 +0x45 fp=0xc000605d58 sp=0xc000605d30 pc=0x561d42f08645 github.com/spf13/cobra.(*Command).execute(0xc000020f08, {0xc0000b8ff0, 0xf, 0xf}) github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc000605e78 sp=0xc000605d58 pc=0x561d424332dc github.com/spf13/cobra.(*Command).ExecuteC(0xc0000e0908) github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc000605f30 sp=0xc000605e78 pc=0x561d42433b25 github.com/spf13/cobra.(*Command).Execute(...) github.com/spf13/cobra@v1.7.0/command.go:992 github.com/spf13/cobra.(*Command).ExecuteContext(...) github.com/spf13/cobra@v1.7.0/command.go:985 main.main() github.com/ollama/ollama/main.go:12 +0x4d fp=0xc000605f50 sp=0xc000605f30 pc=0x561d42f090cd runtime.main() runtime/proc.go:283 +0x29d fp=0xc000605fe0 sp=0xc000605f50 pc=0x561d42287ebd runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000605fe8 sp=0xc000605fe0 pc=0x561d422c3481 goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x561d422bbd4e runtime.goparkunlock(...) runtime/proc.go:441 runtime.forcegchelper() runtime/proc.go:348 +0xb8 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x561d422881f8 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x561d422c3481 created by runtime.init.7 in goroutine 1 runtime/proc.go:336 +0x1a goroutine 3 gp=0xc000003340 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000085780 sp=0xc000085760 pc=0x561d422bbd4e runtime.goparkunlock(...) runtime/proc.go:441 runtime.bgsweep(0xc000042080) runtime/mgcsweep.go:316 +0xdf fp=0xc0000857c8 sp=0xc000085780 pc=0x561d4227299f runtime.gcenable.gowrap1() runtime/mgc.go:204 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x561d42266d85 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x561d422c3481 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0x66 goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait]: runtime.gopark(0x10000?, 0x561d43322c38?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x561d422bbd4e runtime.goparkunlock(...) runtime/proc.go:441 runtime.(*scavengerState).park(0x561d43eaa8e0) runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x561d422703e9 runtime.bgscavenge(0xc000042080) runtime/mgcscavenge.go:658 +0x59 fp=0xc000085fc8 sp=0xc000085fa8 pc=0x561d42270979 runtime.gcenable.gowrap2() runtime/mgc.go:205 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x561d42266d25 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x561d422c3481 created by runtime.gcenable in goroutine 1 runtime/mgc.go:205 +0xa5 goroutine 18 gp=0xc000186380 m=nil [finalizer wait]: runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000084688?) runtime/proc.go:435 +0xce fp=0xc000084630 sp=0xc000084610 pc=0x561d422bbd4e runtime.runfinq() runtime/mfinal.go:196 +0x107 fp=0xc0000847e0 sp=0xc000084630 pc=0x561d42265d47 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x561d422c3481 created by runtime.createfing in goroutine 1 runtime/mfinal.go:166 +0x3d goroutine 19 gp=0xc000186e00 m=nil [chan receive]: runtime.gopark(0xc0002a17c0?, 0xc000598018?, 0x60?, 0x7?, 0x561d423a09c8?) runtime/proc.go:435 +0xce fp=0xc000080718 sp=0xc0000806f8 pc=0x561d422bbd4e runtime.chanrecv(0xc000182310, 0x0, 0x1) runtime/chan.go:664 +0x445 fp=0xc000080790 sp=0xc000080718 pc=0x561d42257725 runtime.chanrecv1(0x0?, 0x0?) runtime/chan.go:506 +0x12 fp=0xc0000807b8 sp=0xc000080790 pc=0x561d422572b2 runtime.unique_runtime_registerUniqueMapCleanup.func2(...) runtime/mgc.go:1796 runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() runtime/mgc.go:1799 +0x2f fp=0xc0000807e0 sp=0xc0000807b8 pc=0x561d42269f2f runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x561d422c3481 created by unique.runtime_registerUniqueMapCleanup in goroutine 1 runtime/mgc.go:1794 +0x85 goroutine 20 gp=0xc000187180 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000080f38 sp=0xc000080f18 pc=0x561d422bbd4e runtime.gcBgMarkWorker(0xc000183730) runtime/mgc.go:1423 +0xe9 fp=0xc000080fc8 sp=0xc000080f38 pc=0x561d42269249 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc000080fe0 sp=0xc000080fc8 pc=0x561d42269125 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x561d422c3481 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 34 gp=0xc000102380 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000118738 sp=0xc000118718 pc=0x561d422bbd4e runtime.gcBgMarkWorker(0xc000183730) runtime/mgc.go:1423 +0xe9 fp=0xc0001187c8 sp=0xc000118738 pc=0x561d42269249 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0001187e0 sp=0xc0001187c8 pc=0x561d42269125 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0001187e8 sp=0xc0001187e0 pc=0x561d422c3481 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 35 gp=0xc000102540 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000118f38 sp=0xc000118f18 pc=0x561d422bbd4e runtime.gcBgMarkWorker(0xc000183730) runtime/mgc.go:1423 +0xe9 fp=0xc000118fc8 sp=0xc000118f38 pc=0x561d42269249 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc000118fe0 sp=0xc000118fc8 pc=0x561d42269125 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000118fe8 sp=0xc000118fe0 pc=0x561d422c3481 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 36 gp=0xc000102700 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000119738 sp=0xc000119718 pc=0x561d422bbd4e runtime.gcBgMarkWorker(0xc000183730) runtime/mgc.go:1423 +0xe9 fp=0xc0001197c8 sp=0xc000119738 pc=0x561d42269249 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0001197e0 sp=0xc0001197c8 pc=0x561d42269125 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0001197e8 sp=0xc0001197e0 pc=0x561d422c3481 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 37 gp=0xc0001028c0 m=nil [GC worker (idle)]: runtime.gopark(0xbc121deb01?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000119f38 sp=0xc000119f18 pc=0x561d422bbd4e runtime.gcBgMarkWorker(0xc000183730) runtime/mgc.go:1423 +0xe9 fp=0xc000119fc8 sp=0xc000119f38 pc=0x561d42269249 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc000119fe0 sp=0xc000119fc8 pc=0x561d42269125 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000119fe8 sp=0xc000119fe0 pc=0x561d422c3481 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 38 gp=0xc000102a80 m=nil [GC worker (idle)]: runtime.gopark(0xbc1247c9c3?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00011a738 sp=0xc00011a718 pc=0x561d422bbd4e runtime.gcBgMarkWorker(0xc000183730) runtime/mgc.go:1423 +0xe9 fp=0xc00011a7c8 sp=0xc00011a738 pc=0x561d42269249 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00011a7e0 sp=0xc00011a7c8 pc=0x561d42269125 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00011a7e8 sp=0xc00011a7e0 pc=0x561d422c3481 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 39 gp=0xc000102c40 m=nil [GC worker (idle)]: runtime.gopark(0xbc121dec6a?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00011af38 sp=0xc00011af18 pc=0x561d422bbd4e runtime.gcBgMarkWorker(0xc000183730) runtime/mgc.go:1423 +0xe9 fp=0xc00011afc8 sp=0xc00011af38 pc=0x561d42269249 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00011afe0 sp=0xc00011afc8 pc=0x561d42269125 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00011afe8 sp=0xc00011afe0 pc=0x561d422c3481 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 40 gp=0xc000102e00 m=nil [GC worker (idle)]: runtime.gopark(0x561d43f59120?, 0x1?, 0x9e?, 0x4c?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00011b738 sp=0xc00011b718 pc=0x561d422bbd4e runtime.gcBgMarkWorker(0xc000183730) runtime/mgc.go:1423 +0xe9 fp=0xc00011b7c8 sp=0xc00011b738 pc=0x561d42269249 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00011b7e0 sp=0xc00011b7c8 pc=0x561d42269125 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00011b7e8 sp=0xc00011b7e0 pc=0x561d422c3481 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 41 gp=0xc000102fc0 m=nil [GC worker (idle)]: runtime.gopark(0xbc121e079f?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00011bf38 sp=0xc00011bf18 pc=0x561d422bbd4e runtime.gcBgMarkWorker(0xc000183730) runtime/mgc.go:1423 +0xe9 fp=0xc00011bfc8 sp=0xc00011bf38 pc=0x561d42269249 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00011bfe0 sp=0xc00011bfc8 pc=0x561d42269125 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00011bfe8 sp=0xc00011bfe0 pc=0x561d422c3481 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 42 gp=0xc000103180 m=nil [GC worker (idle)]: runtime.gopark(0xbc121e39b5?, 0x1?, 0x7c?, 0x49?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000114738 sp=0xc000114718 pc=0x561d422bbd4e runtime.gcBgMarkWorker(0xc000183730) runtime/mgc.go:1423 +0xe9 fp=0xc0001147c8 sp=0xc000114738 pc=0x561d42269249 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0001147e0 sp=0xc0001147c8 pc=0x561d42269125 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0001147e8 sp=0xc0001147e0 pc=0x561d422c3481 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 43 gp=0xc000103340 m=nil [GC worker (idle)]: runtime.gopark(0x561d43f59120?, 0x1?, 0x5a?, 0x33?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000114f38 sp=0xc000114f18 pc=0x561d422bbd4e runtime.gcBgMarkWorker(0xc000183730) runtime/mgc.go:1423 +0xe9 fp=0xc000114fc8 sp=0xc000114f38 pc=0x561d42269249 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc000114fe0 sp=0xc000114fc8 pc=0x561d42269125 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000114fe8 sp=0xc000114fe0 pc=0x561d422c3481 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 5 gp=0xc000003a40 m=nil [GC worker (idle)]: runtime.gopark(0xbc121e05f4?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000086738 sp=0xc000086718 pc=0x561d422bbd4e runtime.gcBgMarkWorker(0xc000183730) runtime/mgc.go:1423 +0xe9 fp=0xc0000867c8 sp=0xc000086738 pc=0x561d42269249 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0000867e0 sp=0xc0000867c8 pc=0x561d42269125 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000867e8 sp=0xc0000867e0 pc=0x561d422c3481 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 7 gp=0xc000187dc0 m=nil [sync.WaitGroup.Wait]: runtime.gopark(0x0?, 0x0?, 0x80?, 0x1?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000087ed0 sp=0xc000087eb0 pc=0x561d422bbd4e runtime.goparkunlock(...) runtime/proc.go:441 runtime.semacquire1(0xc000034908, 0x0, 0x1, 0x0, 0x18) runtime/sema.go:188 +0x229 fp=0xc000087f38 sp=0xc000087ed0 pc=0x561d4229b489 sync.runtime_SemacquireWaitGroup(0x0?) runtime/sema.go:110 +0x25 fp=0xc000087f70 sp=0xc000087f38 pc=0x561d422bd765 sync.(*WaitGroup).Wait(0x0?) sync/waitgroup.go:118 +0x48 fp=0xc000087f98 sp=0xc000087f70 pc=0x561d422cedc8 github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc000034900, {0x561d4361c880, 0xc0000f4460}) github.com/ollama/ollama/runner/ollamarunner/runner.go:355 +0x25 fp=0xc000087fb8 sp=0xc000087f98 pc=0x561d427a5a85 github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap2() github.com/ollama/ollama/runner/ollamarunner/runner.go:960 +0x28 fp=0xc000087fe0 sp=0xc000087fb8 pc=0x561d427ab208 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000087fe8 sp=0xc000087fe0 pc=0x561d422c3481 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/ollamarunner/runner.go:960 +0xa74 time=2025-06-29T02:00:15.819-07:00 level=ERROR source=server.go:464 msg="llama runner terminated" error="exit status 2" time=2025-06-29T02:00:15.842-07:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: error:fault" [GIN] 2025/06/29 - 02:00:15 | 500 | 702.439409ms | 127.0.0.1 | POST "/api/chat" time=2025-06-29T02:00:20.982-07:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.140254431 runner.size="11.2 GiB" runner.vram="8.6 GiB" runner.parallel=1 runner.pid=2786 runner.model=/mnt/HDD/AI/ollama/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de time=2025-06-29T02:00:21.233-07:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.390768099 runner.size="11.2 GiB" runner.vram="8.6 GiB" runner.parallel=1 runner.pid=2786 runner.model=/mnt/HDD/AI/ollama/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de time=2025-06-29T02:00:21.482-07:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.640005646 runner.size="11.2 GiB" runner.vram="8.6 GiB" runner.parallel=1 runner.pid=2786 runner.model=/mnt/HDD/AI/ollama/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de ````
Author
Owner

@ThomasRocha82 commented on GitHub (Jun 29, 2025):

I'm getting this:

gpu VRAM usage didn't recover within timeout. It's only Qwen 3 - as well as some of the other errors people are reporting.

I guess it's broken for that LLM. Other models work fine.

4080 Nvidia GPU, 13th Gen i9 and 64GB RAM.

Reinstall didn't work for me. Downgrading to the previous version (before this last update) did though.

<!-- gh-comment-id:3016618978 --> @ThomasRocha82 commented on GitHub (Jun 29, 2025): I'm getting this: gpu VRAM usage didn't recover within timeout. It's only Qwen 3 - as well as some of the other errors people are reporting. I guess it's broken for that LLM. Other models work fine. 4080 Nvidia GPU, 13th Gen i9 and 64GB RAM. Reinstall didn't work for me. Downgrading to the previous version (before this last update) did though.
Author
Owner

@rick-github commented on GitHub (Jun 29, 2025):

gpu VRAM usage didn't recover within timeout.

This is not an error, it's a status message saying that ollama is waiting for the GPU driver to release VRAM.

It would be easier to debug whatever issue you are having if you post a full log.

<!-- gh-comment-id:3016640901 --> @rick-github commented on GitHub (Jun 29, 2025): > gpu VRAM usage didn't recover within timeout. This is not an error, it's a status message saying that ollama is waiting for the GPU driver to release VRAM. It would be easier to debug whatever issue you are having if you post a full log.
Author
Owner

@ThomasRocha82 commented on GitHub (Jun 29, 2025):

It was a similar error as to what @zerowxt posted in their error log. Same runner error code and all.

I uninstalled it via add/remove programs (0.9.3) and re-installed the previous version via the installer (0.9.2). Prior to that, I had uninstalled it in the same way (0.9.3) and re-installed the current version (0.9.3) - no dice.

I saw https://github.com/ollama/ollama/issues/11211 but I don't do manual installs. Upgrades are through the tray icon and installs via installer and uninstalls through the uninstaller. Windows 11 Pro.

I'll get a log once I have some down time as I need the AI working on some stuff, but I have a feeling it's likely the same scenario (Qwen3, similar log entries, only happened after update -- lama runner terminated" error="exit status 2", "error loading llama server" error="llama runner process has terminated: error:fault" etc. etc.).

But for now, I'll stick with what works (0.9.2) so I can continue my work.

<!-- gh-comment-id:3016664199 --> @ThomasRocha82 commented on GitHub (Jun 29, 2025): It was a similar error as to what @zerowxt posted in their error log. Same runner error code and all. I uninstalled it via add/remove programs (0.9.3) and re-installed the previous version via the installer (0.9.2). Prior to that, I had uninstalled it in the same way (0.9.3) and re-installed the current version (0.9.3) - no dice. I saw https://github.com/ollama/ollama/issues/11211 but I don't do manual installs. Upgrades are through the tray icon and installs via installer and uninstalls through the uninstaller. Windows 11 Pro. I'll get a log once I have some down time as I need the AI working on some stuff, but I have a feeling it's likely the same scenario (Qwen3, similar log entries, only happened after update -- lama runner terminated" error="exit status 2", "error loading llama server" error="llama runner process has terminated: error:fault" etc. etc.). But for now, I'll stick with what works (0.9.2) so I can continue my work.
Author
Owner

@sol8712 commented on GitHub (Jun 29, 2025):

load_backend: loaded CUDA backend from D:\programmer\ai\ollama\lib\ollama\ggml-cuda.dll
load_backend: loaded CUDA backend from D:\programmer\ai\ollama\lib\ollama\cuda_v12\ggml-cuda.dll

Your runner is loading the CUDA backend twice from different locations, so this is probably #11211. See here for details.

Initially this seemed to fix the issue, i know it was working, however ive started getting new errors running gemma3 after applying the solution (deleted files reinstalled, confirmed only 1 cuda loaded not 2)

400: "gemma3:12b" does not support chat

<!-- gh-comment-id:3016813915 --> @sol8712 commented on GitHub (Jun 29, 2025): > ``` > load_backend: loaded CUDA backend from D:\programmer\ai\ollama\lib\ollama\ggml-cuda.dll > load_backend: loaded CUDA backend from D:\programmer\ai\ollama\lib\ollama\cuda_v12\ggml-cuda.dll > ``` > > Your runner is loading the CUDA backend twice from different locations, so this is probably [#11211](https://github.com/ollama/ollama/issues/11211). See [here](https://github.com/ollama/ollama/issues/11211#issuecomment-3014572069) for details. Initially this seemed to fix the issue, i know it was working, however ive started getting new errors running gemma3 after applying the solution (deleted files reinstalled, confirmed only 1 cuda loaded not 2) `400: "gemma3:12b" does not support chat`
Author
Owner

@rick-github commented on GitHub (Jun 29, 2025):

400: "gemma3:12b" does not support chat

Server logs will aid in debugging.

<!-- gh-comment-id:3016816660 --> @rick-github commented on GitHub (Jun 29, 2025): > 400: "gemma3:12b" does not support chat [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@sol8712 commented on GitHub (Jun 29, 2025):

400: "gemma3:12b" does not support chat

Server logs will aid in debugging.

I got to the logs and...my bad this was user error. The original multiple cuda back-end manual install issue is fixed for me, hopefully this also fixes it for @zerowxt as well and the topic can be solved with https://github.com/ollama/ollama/issues/11229#issuecomment-3015242276

<!-- gh-comment-id:3017037352 --> @sol8712 commented on GitHub (Jun 29, 2025): > > 400: "gemma3:12b" does not support chat > > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging. I got to the logs and...my bad this was user error. The original multiple cuda back-end manual install issue is fixed for me, hopefully this also fixes it for @zerowxt as well and the topic can be solved with https://github.com/ollama/ollama/issues/11229#issuecomment-3015242276
Author
Owner

@CrazyBunQnQ commented on GitHub (Jun 30, 2025):

I also cannot run Gemma3:12b right after latest update on Thursday on Windows Other models are fine. Version after the update is: ollama version is 0.9.3

i have same error

<!-- gh-comment-id:3017730645 --> @CrazyBunQnQ commented on GitHub (Jun 30, 2025): > I also cannot run Gemma3:12b right after latest update on Thursday on Windows Other models are fine. Version after the update is: ollama version is 0.9.3 i have same error
Author
Owner

@hkbb2014 commented on GitHub (Jun 30, 2025):

me too.

ollama run gemma3n:e4b
⠇ Error: llama runner process has terminated: error:fault

My server.log

<!-- gh-comment-id:3018576431 --> @hkbb2014 commented on GitHub (Jun 30, 2025): me too. > ollama run gemma3n:e4b **⠇ Error: llama runner process has terminated: error:fault** My [server.log](https://github.com/user-attachments/files/20977518/server.log)
Author
Owner

@Amigos-Flipado commented on GitHub (Jun 30, 2025):

I have completely uninstalled Ollama (except for the models, of course) and when I reinstalled clean everything has returned to normal and all the models now work.

<!-- gh-comment-id:3018589791 --> @Amigos-Flipado commented on GitHub (Jun 30, 2025): I have completely uninstalled Ollama (except for the models, of course) and when I reinstalled clean everything has returned to normal and all the models now work.
Author
Owner

@rick-github commented on GitHub (Jun 30, 2025):

@hkbb2014

load_backend: loaded CUDA backend from Y:\ollama\lib\ollama\ggml-cuda.dll
load_backend: loaded CUDA backend from Y:\ollama\lib\ollama\cuda_v12\ggml-cuda.dll

https://github.com/ollama/ollama/issues/11229#issuecomment-3015242276

<!-- gh-comment-id:3018655886 --> @rick-github commented on GitHub (Jun 30, 2025): @hkbb2014 ``` load_backend: loaded CUDA backend from Y:\ollama\lib\ollama\ggml-cuda.dll load_backend: loaded CUDA backend from Y:\ollama\lib\ollama\cuda_v12\ggml-cuda.dll ``` https://github.com/ollama/ollama/issues/11229#issuecomment-3015242276
Author
Owner

@vgrechin commented on GitHub (Jun 30, 2025):

I removed Ollama via Control Panel and reinstalled from the latest package of Ollama v0.9.3. It basically solved the issue, now I can run gemma3:12b again, Thanks!

<!-- gh-comment-id:3019411738 --> @vgrechin commented on GitHub (Jun 30, 2025): I removed Ollama via Control Panel and reinstalled from the latest package of Ollama v0.9.3. It basically solved the issue, now I can run gemma3:12b again, Thanks!
Author
Owner

@jeepshop commented on GitHub (Jun 30, 2025):

I'm curious as well, I'm running 0.9.3 and while I can run gemma3, it's extremely CPU intensive so not practical.

I've tried both of the following with similar results on my V100. 60% VRAM usage, 50% GPU, 1400%CPU (I assume that means 14 of my 32 cores?) yeilding11 tok/s.

  • hf.co/unsloth/gemma-3-27b-it-qat-GGUF:Q4_K_XL
  • gemma3:27b-it-q4_K_M

Also having similar performance issues with

  • mistral-small3.2:24b-instruct-2506-q4_K_M
<!-- gh-comment-id:3020826144 --> @jeepshop commented on GitHub (Jun 30, 2025): I'm curious as well, I'm running 0.9.3 and while I can run gemma3, it's extremely CPU intensive so not practical. I've tried both of the following with similar results on my V100. 60% VRAM usage, 50% GPU, 1400%CPU (I assume that means 14 of my 32 cores?) yeilding11 tok/s. - hf.co/unsloth/gemma-3-27b-it-qat-GGUF:Q4_K_XL - gemma3:27b-it-q4_K_M Also having similar performance issues with - mistral-small3.2:24b-instruct-2506-q4_K_M
Author
Owner

@rick-github commented on GitHub (Jul 1, 2025):

High CPU usage would imply that some of the layers are being offloaded to system RAM where the slower CPU does inference. Server logs will show layer assignment.

model tps all layers in GPU tps 50% layers in GPU
hf.co/unsloth/gemma-3-27b-it-qat-GGUF:Q4_K_XL 40.61 9.72
gemma3:27b-it-q4_K_M 41.70 9.90
mistral-small3.2:24b-instruct-2506-q4_K_M 51.02 12.79
deepseek-r1:32b 39.31 8.87

Note that vision model quants from HF usually don't have the format that ollama expects, so hf.co/unsloth/gemma-3-27b-it-qat-GGUF:Q4_K_XL will not be able to process images.

<!-- gh-comment-id:3023298243 --> @rick-github commented on GitHub (Jul 1, 2025): High CPU usage would imply that some of the layers are being offloaded to system RAM where the slower CPU does inference. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will show layer assignment. | model | tps all layers in GPU | tps 50% layers in GPU | | -- | -- | -- | | hf.co/unsloth/gemma-3-27b-it-qat-GGUF:Q4_K_XL | 40.61 | 9.72 | | gemma3:27b-it-q4_K_M | 41.70 | 9.90 | | mistral-small3.2:24b-instruct-2506-q4_K_M | 51.02 | 12.79 | | deepseek-r1:32b | 39.31 | 8.87 | Note that vision model quants from HF usually don't have the format that ollama expects, so hf.co/unsloth/gemma-3-27b-it-qat-GGUF:Q4_K_XL will not be able to process images.
Author
Owner

@jeepshop commented on GitHub (Jul 1, 2025):

I should of mentioned earlier, I have no performance issues running other similar sized and larger models. Only having this issue with gemma3 and the mistral3 models. Deepseek, Devstral, Phi4 all run at expected generation speeds.

Here is the script I'm running (thanks to you actually :))

curl -s http://localhost:11434/api/generate -d "{
  \"model\": \"gemma3:27b-it-q4_K_M\",
  \"prompt\": \"Tell me a detailed story about the rise and fall of an ancient civilization, including cultural, technological, and political aspects.\",
  \"options\": {\"num_ctx\": 16384, \"keep_alive\": \"1m\"},
  \"stream\": false
}" | jq '{
  model: "'$MODEL'",
  context_size: '$CONTEXT',
  tokens: .eval_count,
  duration_seconds: (.eval_duration / 1e9),
  tokens_per_second: (.eval_count / (.eval_duration / 1e9))
}'

I see these in the log, which seems to imply that everything is in VRAM and not any in system RAM.

Jul 01 08:42:34 Darkstar ollama[1086]: time=2025-07-01T08:42:34.396-04:00 level=INFO source=sched.go:788 msg="new model **will fit in available VRAM** in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-e796792eba26c4d3b04b0ac5adb01a453dd9ec2dfd83b6c59cbf6fe5f30b0f68 gpu=GPU-2229ba74-dbe4-5c32-e086-3ca1f0ab65b4 parallel=1 available=33754841088 required="19.9 GiB"

Jul 01 08:42:34 Darkstar ollama[1086]: time=2025-07-01T08:42:34.823-04:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=63 **layers.offload=63** layers.split="" **memory.available="[31.4 GiB]"** memory.gpu_overhead="0 B" **memory.required.full="19.9 GiB"** memory.required.partial="19.9 GiB" memory.required.kv="952.0 MiB" memory.required.allocations="[19.9 GiB]" memory.weights.total="15.4 GiB" memory.weights.repeating="14.3 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB

Here is the log during that run of gemma3.
Gemma3.log

And here is a grab of nvtop during the run - showing 1400% CPU.
Image

<!-- gh-comment-id:3023934357 --> @jeepshop commented on GitHub (Jul 1, 2025): I should of mentioned earlier, I have no performance issues running other similar sized and larger models. Only having this issue with gemma3 and the mistral3 models. Deepseek, Devstral, Phi4 all run at expected generation speeds. Here is the script I'm running (thanks to you actually :)) ```bash curl -s http://localhost:11434/api/generate -d "{ \"model\": \"gemma3:27b-it-q4_K_M\", \"prompt\": \"Tell me a detailed story about the rise and fall of an ancient civilization, including cultural, technological, and political aspects.\", \"options\": {\"num_ctx\": 16384, \"keep_alive\": \"1m\"}, \"stream\": false }" | jq '{ model: "'$MODEL'", context_size: '$CONTEXT', tokens: .eval_count, duration_seconds: (.eval_duration / 1e9), tokens_per_second: (.eval_count / (.eval_duration / 1e9)) }' ``` I see these in the log, which seems to imply that everything is in VRAM and not any in system RAM. `Jul 01 08:42:34 Darkstar ollama[1086]: time=2025-07-01T08:42:34.396-04:00 level=INFO source=sched.go:788 msg="new model **will fit in available VRAM** in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-e796792eba26c4d3b04b0ac5adb01a453dd9ec2dfd83b6c59cbf6fe5f30b0f68 gpu=GPU-2229ba74-dbe4-5c32-e086-3ca1f0ab65b4 parallel=1 available=33754841088 required="19.9 GiB"` `Jul 01 08:42:34 Darkstar ollama[1086]: time=2025-07-01T08:42:34.823-04:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=63 **layers.offload=63** layers.split="" **memory.available="[31.4 GiB]"** memory.gpu_overhead="0 B" **memory.required.full="19.9 GiB"** memory.required.partial="19.9 GiB" memory.required.kv="952.0 MiB" memory.required.allocations="[19.9 GiB]" memory.weights.total="15.4 GiB" memory.weights.repeating="14.3 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB` Here is the log during that run of gemma3. [Gemma3.log](https://github.com/user-attachments/files/20998781/Gemma3.log) And here is a grab of nvtop during the run - showing 1400% CPU. ![Image](https://github.com/user-attachments/assets/27bd3b19-c34d-438d-8089-f8be87f0e4e9)
Author
Owner

@rick-github commented on GitHub (Jul 1, 2025):

What are the expected and actual generation speeds? I ran your script on an RTX6000 with 48G and got these results, with CPU usage ~100%:

model tps
hf.co/unsloth/gemma-3-27b-it-qat-GGUF:Q4_K_XL 38.02
gemma3:27b-it-q4_K_M 38.27
mistral-small3.2:24b-instruct-2506-q4_K_M 47.09
deepseek-r1:32b-qwen-distill-q4_K_M 33.78
devstral:24b-small-2505-q4_K_M 46.37
phi4:14b-q4_K_M 72.26

I see these in the log, which seems to imply that everything is in VRAM and not any in system RAM.

These are aspirational. The ollama server has calculated that everything can fit in VRAM, but it's up to the runner to do the actual layer loading. Some layers will not be loaded into VRAM because the shape of a tensor won't fit or a necessary operation is not supported by the hardware. The log snippet shows:

Jul 01 08:42:35 Darkstar ollama[1086]: time=2025-07-01T08:42:35.205-04:00 level=INFO source=ggml.go:359 msg="model weights" buffer=CUDA0 size="16.2 GiB"
Jul 01 08:42:35 Darkstar ollama[1086]: time=2025-07-01T08:42:35.205-04:00 level=INFO source=ggml.go:359 msg="model weights" buffer=CPU size="1.1 GiB"

so some of the model weights are ending up in system RAM. This is usually the output layer, and its impact on token generation should be minimal, so might not be the cause of the problem you are seeing. Logging with OLLAMA_DEBUG=2 will show the actual layer assignment.

<!-- gh-comment-id:3024192072 --> @rick-github commented on GitHub (Jul 1, 2025): What are the expected and actual generation speeds? I ran your script on an RTX6000 with 48G and got these results, with CPU usage ~100%: | model | tps | | -- | -- | | hf.co/unsloth/gemma-3-27b-it-qat-GGUF:Q4_K_XL | 38.02 | | gemma3:27b-it-q4_K_M | 38.27 | | mistral-small3.2:24b-instruct-2506-q4_K_M | 47.09 | | deepseek-r1:32b-qwen-distill-q4_K_M | 33.78 | | devstral:24b-small-2505-q4_K_M | 46.37 | | phi4:14b-q4_K_M | 72.26 | <!-- Plotting parameter count against tps shows a strong correlation ($R^2$ = 0.9334): ![Image](https://github.com/user-attachments/assets/8fe0d3da-abec-4617-99a9-6a4814d22d45) --> > I see these in the log, which seems to imply that everything is in VRAM and not any in system RAM. These are aspirational. The ollama server has calculated that everything can fit in VRAM, but it's up to the runner to do the actual layer loading. Some layers will not be loaded into VRAM because the shape of a tensor won't fit or a necessary operation is not supported by the hardware. The log snippet shows: ``` Jul 01 08:42:35 Darkstar ollama[1086]: time=2025-07-01T08:42:35.205-04:00 level=INFO source=ggml.go:359 msg="model weights" buffer=CUDA0 size="16.2 GiB" Jul 01 08:42:35 Darkstar ollama[1086]: time=2025-07-01T08:42:35.205-04:00 level=INFO source=ggml.go:359 msg="model weights" buffer=CPU size="1.1 GiB" ``` so some of the model weights are ending up in system RAM. This is usually the output layer, and its impact on token generation should be minimal, so might not be the cause of the problem you are seeing. Logging with `OLLAMA_DEBUG=2` will show the actual layer assignment.
Author
Owner

@jeepshop commented on GitHub (Jul 1, 2025):

Hard to know what I expect, was thinking it would be in the same ballparks as [devstral:24b-small-2505-q4_K_M, deepseek-r1:32b-q4_K_M] which I get [39, 26] tok/s and very little CPU use.

What I get with [gemma3:27b-it-q4_K_M] is 11-14 tok/s.

Attached is a fresh run at OLLAMA_DEBUG=2

Gemma3.log

<!-- gh-comment-id:3024484339 --> @jeepshop commented on GitHub (Jul 1, 2025): Hard to know what I **expect**, was thinking it would be in the same ballparks as [devstral:24b-small-2505-q4_K_M, deepseek-r1:32b-q4_K_M] which I get [39, 26] tok/s and very little CPU use. What I get with [gemma3:27b-it-q4_K_M] is 11-14 tok/s. Attached is a fresh run at OLLAMA_DEBUG=2 [Gemma3.log](https://github.com/user-attachments/files/21003342/Gemma3.log)
Author
Owner

@rick-github commented on GitHub (Jul 1, 2025):

Try changing the flash attention settings. https://github.com/ollama/ollama/issues/9683

<!-- gh-comment-id:3024743112 --> @rick-github commented on GitHub (Jul 1, 2025): Try changing the flash attention settings. https://github.com/ollama/ollama/issues/9683
Author
Owner

@jeepshop commented on GitHub (Jul 2, 2025):

I have verified that disabling flash attention does in fact solve the problem for me. Now I'm getting ~27 tok/s with almost 0 CPU load. So #9683 was definitely my issue.

<!-- gh-comment-id:3027937956 --> @jeepshop commented on GitHub (Jul 2, 2025): I have verified that disabling flash attention does in fact solve the problem for me. Now I'm getting ~27 tok/s with almost 0 CPU load. So #9683 was definitely my issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53909