[GH-ISSUE #10855] Getting "500: Ollama: 500, message='Internal Server Error'" with some models such as Gemma3:12b-it-qat #7129

Closed
opened 2026-04-12 19:08:19 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @mekler22 on GitHub (May 25, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10855

What is the issue?

In the past I ran this model with no issue on my (linux) docker installation with a RTX 3060 12GB card. Now it does not work claiming running out of vram. This is the 4bit quantized version which should work without an issue with 12gb vram, and it did work at first and then did not.

Relevant log output

time=2025-05-25T08:43:54.316Z level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-05-25T08:43:54.316Z level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="1.1 GiB"
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4386.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 4599056512
time=2025-05-25T08:43:55.565Z level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="4.3 GiB"
time=2025-05-25T08:43:55.565Z level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="4.0 GiB"
panic: insufficient memory - required allocations: {InputWeights:2013265920A CPU:{Name:CPU Weights:[126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 2867456448A] Cache:[12582912A 12582912A 12582912A 12582912A 12582912A 1073741824A 12582912A 12582912A 12582912A 12582912A 12582912A 1073741824A 12582912A 12582912A 12582912A 12582912A 12582912A 1073741824A 12582912A 12582912A 12582912A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:4335337472A} GPUs:[{Name:CUDA0 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 12582912A 12582912A 1073741824A 12582912A 12582912A 12582912A 12582912A 12582912A 1073741824A 12582912A 12582912A 12582912A 12582912A 12582912A 1073741824A 12582912A 12582912A 12582912A 12582912A 12582912A 1073741824A 12582912A 12582912A 12582912A 12582912A 12582912A 1073741824A 0U] Graph:4599056512F}]}
goroutine 6 [running]:
github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0xc000f8d2c0)
	github.com/ollama/ollama/ml/backend/ggml/ggml.go:643 +0x756
github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc0000359e0)
	github.com/ollama/ollama/runner/ollamarunner/runner.go:826 +0xbcd
github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc0000359e0, {0x7ffec004dcac?, 0x0?}, {0x6, 0x0, 0x1b, {0x0, 0x0, 0x0}, 0x0}, ...)
	github.com/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270
github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc0000359e0, {0x56495753ca90, 0xc00061ce60}, {0x7ffec004dcac?, 0x0?}, {0x6, 0x0, 0x1b, {0x0, 0x0, ...}, ...}, ...)
	github.com/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	github.com/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11
time=2025-05-25T08:43:55.737Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-05-25T08:43:55.774Z level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2"
time=2025-05-25T08:43:55.987Z level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 4599056512"

OS

Linux, Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.7.1

Originally created by @mekler22 on GitHub (May 25, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10855 ### What is the issue? In the past I ran this model with no issue on my (linux) docker installation with a RTX 3060 12GB card. Now it does not work claiming running out of vram. This is the 4bit quantized version which should work without an issue with 12gb vram, and it did work at first and then did not. ### Relevant log output ```shell time=2025-05-25T08:43:54.316Z level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-05-25T08:43:54.316Z level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="1.1 GiB" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4386.00 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 4599056512 time=2025-05-25T08:43:55.565Z level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="4.3 GiB" time=2025-05-25T08:43:55.565Z level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="4.0 GiB" panic: insufficient memory - required allocations: {InputWeights:2013265920A CPU:{Name:CPU Weights:[126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 126138368A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 2867456448A] Cache:[12582912A 12582912A 12582912A 12582912A 12582912A 1073741824A 12582912A 12582912A 12582912A 12582912A 12582912A 1073741824A 12582912A 12582912A 12582912A 12582912A 12582912A 1073741824A 12582912A 12582912A 12582912A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:4335337472A} GPUs:[{Name:CUDA0 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 126139648A 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 12582912A 12582912A 1073741824A 12582912A 12582912A 12582912A 12582912A 12582912A 1073741824A 12582912A 12582912A 12582912A 12582912A 12582912A 1073741824A 12582912A 12582912A 12582912A 12582912A 12582912A 1073741824A 12582912A 12582912A 12582912A 12582912A 12582912A 1073741824A 0U] Graph:4599056512F}]} goroutine 6 [running]: github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0xc000f8d2c0) github.com/ollama/ollama/ml/backend/ggml/ggml.go:643 +0x756 github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc0000359e0) github.com/ollama/ollama/runner/ollamarunner/runner.go:826 +0xbcd github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc0000359e0, {0x7ffec004dcac?, 0x0?}, {0x6, 0x0, 0x1b, {0x0, 0x0, 0x0}, 0x0}, ...) github.com/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270 github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc0000359e0, {0x56495753ca90, 0xc00061ce60}, {0x7ffec004dcac?, 0x0?}, {0x6, 0x0, 0x1b, {0x0, 0x0, ...}, ...}, ...) github.com/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11 time=2025-05-25T08:43:55.737Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-05-25T08:43:55.774Z level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2" time=2025-05-25T08:43:55.987Z level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 4599056512" ``` ### OS Linux, Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.7.1
GiteaMirror added the memorybug labels 2026-04-12 19:08:19 -05:00
Author
Owner

@akaspeh1 commented on GitHub (May 29, 2025):

I have similar issue with gemma 27b.
I cant load with context size 20000, but I it loads ok with context size 40000, 6000, 10000.
Up to 4090 VRAM memory in ollama ps is fine. Go over it and I have to go way over it like 40000 ctx_size to load without crash.
Specs:
RTX 4090 24GB
GTX 1070 8GB

Ollama 0.7.1 Windows

Relevant log

time=2025-05-29T12:16:10.280+02:00 level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
load_backend: loaded CPU backend from C:\Users\jstarman\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
load_backend: loaded CUDA backend from C:\Users\jstarman\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-05-29T12:16:10.426+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-05-29T12:16:10.441+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-05-29T12:16:10.538+02:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="1.1 GiB"
time=2025-05-29T12:16:10.538+02:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="11.0 GiB"
time=2025-05-29T12:16:10.538+02:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="5.2 GiB"
time=2025-05-29T12:16:10.669+02:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-05-29T12:16:10.669+02:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="1.1 GiB"
time=2025-05-29T12:16:10.669+02:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1874.01 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1965040128
time=2025-05-29T12:16:11.059+02:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.8 GiB"
time=2025-05-29T12:16:11.059+02:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="1.8 GiB"
time=2025-05-29T12:16:11.059+02:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="10.5 MiB"
panic: insufficient memory - required allocations: {InputWeights:1156055040A CPU:{Name:CPU Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:11010048A} GPUs:[{Name:CUDA0 Weights:[264974592A 264974592A 264974592A 264974592A 264974592A 264974592A 264974592A 235170048A 235170048A 264974592A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:1954029568A} {Name:CUDA1 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 264974592A 262136064A 262136064A 264974592A 262136064A 262136064A 264974592A 2003000704A] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 0U] Graph:1965040128F}]}

goroutine 28 [running]:
github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0xc001e20980)
	C:/a/ollama/ollama/ml/backend/ggml/ggml.go:643 +0x756
github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc0004b0000)
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:826 +0xbcd
github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc0004b0000, {0xc000042150?, 0x0?}, {0x8, 0x0, 0x3f, {0xc0003261c8, 0x2, 0x2}, 0x0}, ...)
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270
github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc0004b0000, {0x7ff731f6a7c0, 0xc0004b2000}, {0xc000042150?, 0x0?}, {0x8, 0x0, 0x3f, {0xc0003261c8, 0x2, ...}, ...}, ...)
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	C:/a/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11
time=2025-05-29T12:16:11.191+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
<!-- gh-comment-id:2919075123 --> @akaspeh1 commented on GitHub (May 29, 2025): I have similar issue with gemma 27b. I cant load with context size 20000, but I it loads ok with context size 40000, 6000, 10000. Up to 4090 VRAM memory in ollama ps is fine. Go over it and I have to go way over it like 40000 ctx_size to load without crash. Specs: RTX 4090 24GB GTX 1070 8GB Ollama 0.7.1 Windows Relevant log ``` time=2025-05-29T12:16:10.280+02:00 level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36 load_backend: loaded CPU backend from C:\Users\jstarman\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes load_backend: loaded CUDA backend from C:\Users\jstarman\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-05-29T12:16:10.426+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-05-29T12:16:10.441+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-05-29T12:16:10.538+02:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="1.1 GiB" time=2025-05-29T12:16:10.538+02:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="11.0 GiB" time=2025-05-29T12:16:10.538+02:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="5.2 GiB" time=2025-05-29T12:16:10.669+02:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-05-29T12:16:10.669+02:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="1.1 GiB" time=2025-05-29T12:16:10.669+02:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1874.01 MiB on device 1: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1965040128 time=2025-05-29T12:16:11.059+02:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.8 GiB" time=2025-05-29T12:16:11.059+02:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="1.8 GiB" time=2025-05-29T12:16:11.059+02:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="10.5 MiB" panic: insufficient memory - required allocations: {InputWeights:1156055040A CPU:{Name:CPU Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:11010048A} GPUs:[{Name:CUDA0 Weights:[264974592A 264974592A 264974592A 264974592A 264974592A 264974592A 264974592A 235170048A 235170048A 264974592A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:1954029568A} {Name:CUDA1 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 262136064A 235170048A 232331520A 262136064A 235170048A 232331520A 262136064A 264974592A 262136064A 262136064A 264974592A 262136064A 262136064A 264974592A 2003000704A] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 12582912A 12582912A 12582912A 204996608A 12582912A 12582912A 0U] Graph:1965040128F}]} goroutine 28 [running]: github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0xc001e20980) C:/a/ollama/ollama/ml/backend/ggml/ggml.go:643 +0x756 github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc0004b0000) C:/a/ollama/ollama/runner/ollamarunner/runner.go:826 +0xbcd github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc0004b0000, {0xc000042150?, 0x0?}, {0x8, 0x0, 0x3f, {0xc0003261c8, 0x2, 0x2}, 0x0}, ...) C:/a/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270 github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc0004b0000, {0x7ff731f6a7c0, 0xc0004b2000}, {0xc000042150?, 0x0?}, {0x8, 0x0, 0x3f, {0xc0003261c8, 0x2, ...}, ...}, ...) C:/a/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 C:/a/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11 time=2025-05-29T12:16:11.191+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" ````
Author
Owner

@jessegross commented on GitHub (Sep 24, 2025):

I'm going to go ahead and close this now that the new memory management logic is on by default. If you continue to see problems, please file a new issue.

<!-- gh-comment-id:3330143422 --> @jessegross commented on GitHub (Sep 24, 2025): I'm going to go ahead and close this now that the new memory management logic is on by default. If you continue to see problems, please file a new issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7129