[GH-ISSUE #12282] 0.7.0 -> 0.7.1+ breaks memory estimation completely #8164

Open
opened 2026-04-12 20:34:32 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @thot-experiment on GitHub (Sep 14, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12282

What is the issue?

0.7.0 works fine, upgrading to 0.7.1 or any subsequent version (tried 0.8, 0.9. 0.10, 0.11, 0.11.10 and 0.11.11-rc0)

loading mistral on 0.7

time=2025-09-14T00:16:43.796-07:00 level=INFO source=server.go:135 msg="system memory" total="63.6 GiB" free="51.0 GiB" free_swap="125.5 GiB"
time=2025-09-14T00:16:43.827-07:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=999 layers.model=41 layers.offload=24 layers.split=21,3 memory.available="[30.7 GiB 9.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="50.2 GiB" memory.required.partial="40.0 GiB" memory.required.kv="10.0 GiB" memory.required.allocations="[30.2 GiB 9.7 GiB]" memory.weights.total="14.9 GiB" memory.weights.repeating="14.5 GiB" memory.weights.nonrepeating="440.0 MiB" memory.graph.full="6.7 GiB" memory.graph.partial="6.7 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB"
time=2025-09-14T00:16:43.827-07:00 level=INFO source=server.go:211 msg="enabling flash attention"
time=2025-09-14T00:16:43.853-07:00 level=INFO source=server.go:431 msg="starting llama server" cmd="G:\\\ollama\\ollama.exe runner --ollama-engine --model G:\\ollama\\models\\blobs\\sha256-ba565a094c6568442fe08807e976641e68f4a8b97db00058acd53886a9718148 --ctx-size 131072 --batch-size 512 --n-gpu-layers 999 --threads 12 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 21,3 --port 52084"
time=2025-09-14T00:16:43.854-07:00 level=INFO source=sched.go:472 msg="loaded runners" count=1
time=2025-09-14T00:16:43.854-07:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-09-14T00:16:43.854-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-09-14T00:16:43.877-07:00 level=INFO source=runner.go:836 msg="starting ollama engine"
time=2025-09-14T00:16:43.878-07:00 level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:52084"
time=2025-09-14T00:16:43.894-07:00 level=INFO source=ggml.go:73 msg="" architecture=mistral3 file_type=Q5_K_S name="" description="" num_tensors=585 num_key_values=43
load_backend: loaded CPU backend from G:\ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-09-14T00:16:44.105-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Quadro GV100, compute capability 7.0, VMM: yes
  Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
load_backend: loaded CUDA backend from G:\ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-09-14T00:16:44.155-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-09-14T00:16:44.264-07:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA1 size="2.7 GiB"
time=2025-09-14T00:16:44.264-07:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="525.0 MiB"
time=2025-09-14T00:16:44.264-07:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="13.0 GiB"
time=2025-09-14T00:16:49.740-07:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="716.0 MiB"
time=2025-09-14T00:16:49.740-07:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="756.0 MiB"
time=2025-09-14T00:16:49.740-07:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="10.0 MiB"

vs trying to load mistral on 0.7.1

time=2025-09-14T00:22:09.185-07:00 level=INFO source=server.go:135 msg="system memory" total="63.6 GiB" free="50.6 GiB" free_swap="125.3 GiB"
time=2025-09-14T00:22:09.218-07:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=999 layers.model=41 layers.offload=40 layers.split=37,3 memory.available="[30.7 GiB 9.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="49.9 GiB" memory.required.partial="39.9 GiB" memory.required.kv="10.0 GiB" memory.required.allocations="[30.4 GiB 9.6 GiB]" memory.weights.total="14.9 GiB" memory.weights.repeating="14.5 GiB" memory.weights.nonrepeating="440.0 MiB" memory.graph.full="6.7 GiB" memory.graph.partial="6.7 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB"
time=2025-09-14T00:22:09.218-07:00 level=INFO source=server.go:211 msg="enabling flash attention"
time=2025-09-14T00:22:09.244-07:00 level=INFO source=server.go:431 msg="starting llama server" cmd="G:\\ollama\\ollama.exe runner --ollama-engine --model G:\\ollama\\models\\blobs\\sha256-ba565a094c6568442fe08807e976641e68f4a8b97db00058acd53886a9718148 --ctx-size 131072 --batch-size 512 --n-gpu-layers 999 --threads 12 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 37,3 --port 55492"
time=2025-09-14T00:22:09.244-07:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-09-14T00:22:09.245-07:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-09-14T00:22:09.245-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-09-14T00:22:09.266-07:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-09-14T00:22:09.267-07:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:55492"
time=2025-09-14T00:22:09.285-07:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q5_K_S name="" description="" num_tensors=585 num_key_values=43
load_backend: loaded CPU backend from G:\ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-09-14T00:22:09.495-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Quadro GV100, compute capability 7.0, VMM: yes
  Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
load_backend: loaded CUDA backend from G:\ollama\lib\ollama\cuda_v12\ggml-cuda.dll
time=2025-09-14T00:22:10.029-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-09-14T00:22:10.136-07:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="525.0 MiB"
time=2025-09-14T00:22:10.136-07:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="13.8 GiB"
time=2025-09-14T00:22:10.136-07:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="2.0 GiB"
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 9337.48 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 9791055360
time=2025-09-14T00:22:11.217-07:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-09-14T00:22:11.217-07:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB"
time=2025-09-14T00:22:11.217-07:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
panic: insufficient memory - required allocations: {InputWeights:550502400A CPU:{Name:CPU Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} GPUs:[{Name:CUDA0 Weights:[388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} {Name:CUDA1 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 388997120A 388997120A 1339412480A] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:9791055360F}]}

goroutine 15 [running]:
github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0xc000612140)
        C:/a/ollama/ollama/ml/backend/ggml/ggml.go:643 +0x756
github.com/ollama/ollama/runner/ollamarunner.multimodalStore.getTensor(0xc00049b9f8?, {0x7ff7eb5be7f0, 0xc000df2120}, {0x7ff7eb5c2b68, 0xc0011dfe80}, {0x7ff7eb5cec48, 0xc000667638}, 0x1)
        C:/a/ollama/ollama/runner/ollamarunner/multimodal.go:98 +0x2a4
github.com/ollama/ollama/runner/ollamarunner.multimodalStore.getMultimodal(0xc001073cd8, {0x7ff7eb5be7f0, 0xc000df2120}, {0x7ff7eb5c2b68, 0xc0011dfe80}, {0xc0007140c0, 0x1, 0x7ff7eb23e100?}, 0x1)
        C:/a/ollama/ollama/runner/ollamarunner/multimodal.go:56 +0xe5
github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc00063dd40)
        C:/a/ollama/ollama/runner/ollamarunner/runner.go:796 +0x70e
github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc00063dd40, {0xc00003a2a0?, 0x0?}, {0xc, 0x0, 0x3e7, {0xc00045d2b0, 0x2, 0x2}, 0x1}, ...)
        C:/a/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270
github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc00063dd40, {0x7ff7eb5ba7c0, 0xc0004b9590}, {0xc00003a2a0?, 0x0?}, {0xc, 0x0, 0x3e7, {0xc00045d2b0, 0x2, ...}, ...}, ...)
        C:/a/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
        C:/a/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11
time=2025-09-14T00:22:11.314-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-09-14T00:22:11.332-07:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2"
time=2025-09-14T00:22:11.565-07:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory"
Originally created by @thot-experiment on GitHub (Sep 14, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12282 ### What is the issue? 0.7.0 works fine, upgrading to 0.7.1 or any subsequent version (tried 0.8, 0.9. 0.10, 0.11, 0.11.10 and 0.11.11-rc0) loading mistral on 0.7 ``` time=2025-09-14T00:16:43.796-07:00 level=INFO source=server.go:135 msg="system memory" total="63.6 GiB" free="51.0 GiB" free_swap="125.5 GiB" time=2025-09-14T00:16:43.827-07:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=999 layers.model=41 layers.offload=24 layers.split=21,3 memory.available="[30.7 GiB 9.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="50.2 GiB" memory.required.partial="40.0 GiB" memory.required.kv="10.0 GiB" memory.required.allocations="[30.2 GiB 9.7 GiB]" memory.weights.total="14.9 GiB" memory.weights.repeating="14.5 GiB" memory.weights.nonrepeating="440.0 MiB" memory.graph.full="6.7 GiB" memory.graph.partial="6.7 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB" time=2025-09-14T00:16:43.827-07:00 level=INFO source=server.go:211 msg="enabling flash attention" time=2025-09-14T00:16:43.853-07:00 level=INFO source=server.go:431 msg="starting llama server" cmd="G:\\\ollama\\ollama.exe runner --ollama-engine --model G:\\ollama\\models\\blobs\\sha256-ba565a094c6568442fe08807e976641e68f4a8b97db00058acd53886a9718148 --ctx-size 131072 --batch-size 512 --n-gpu-layers 999 --threads 12 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 21,3 --port 52084" time=2025-09-14T00:16:43.854-07:00 level=INFO source=sched.go:472 msg="loaded runners" count=1 time=2025-09-14T00:16:43.854-07:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-09-14T00:16:43.854-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-09-14T00:16:43.877-07:00 level=INFO source=runner.go:836 msg="starting ollama engine" time=2025-09-14T00:16:43.878-07:00 level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:52084" time=2025-09-14T00:16:43.894-07:00 level=INFO source=ggml.go:73 msg="" architecture=mistral3 file_type=Q5_K_S name="" description="" num_tensors=585 num_key_values=43 load_backend: loaded CPU backend from G:\ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-09-14T00:16:44.105-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: Quadro GV100, compute capability 7.0, VMM: yes Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes load_backend: loaded CUDA backend from G:\ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-09-14T00:16:44.155-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-09-14T00:16:44.264-07:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA1 size="2.7 GiB" time=2025-09-14T00:16:44.264-07:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="525.0 MiB" time=2025-09-14T00:16:44.264-07:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="13.0 GiB" time=2025-09-14T00:16:49.740-07:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="716.0 MiB" time=2025-09-14T00:16:49.740-07:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="756.0 MiB" time=2025-09-14T00:16:49.740-07:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="10.0 MiB" ``` vs trying to load mistral on 0.7.1 ``` time=2025-09-14T00:22:09.185-07:00 level=INFO source=server.go:135 msg="system memory" total="63.6 GiB" free="50.6 GiB" free_swap="125.3 GiB" time=2025-09-14T00:22:09.218-07:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=999 layers.model=41 layers.offload=40 layers.split=37,3 memory.available="[30.7 GiB 9.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="49.9 GiB" memory.required.partial="39.9 GiB" memory.required.kv="10.0 GiB" memory.required.allocations="[30.4 GiB 9.6 GiB]" memory.weights.total="14.9 GiB" memory.weights.repeating="14.5 GiB" memory.weights.nonrepeating="440.0 MiB" memory.graph.full="6.7 GiB" memory.graph.partial="6.7 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB" time=2025-09-14T00:22:09.218-07:00 level=INFO source=server.go:211 msg="enabling flash attention" time=2025-09-14T00:22:09.244-07:00 level=INFO source=server.go:431 msg="starting llama server" cmd="G:\\ollama\\ollama.exe runner --ollama-engine --model G:\\ollama\\models\\blobs\\sha256-ba565a094c6568442fe08807e976641e68f4a8b97db00058acd53886a9718148 --ctx-size 131072 --batch-size 512 --n-gpu-layers 999 --threads 12 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 37,3 --port 55492" time=2025-09-14T00:22:09.244-07:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-09-14T00:22:09.245-07:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-09-14T00:22:09.245-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-09-14T00:22:09.266-07:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-09-14T00:22:09.267-07:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:55492" time=2025-09-14T00:22:09.285-07:00 level=INFO source=ggml.go:92 msg="" architecture=mistral3 file_type=Q5_K_S name="" description="" num_tensors=585 num_key_values=43 load_backend: loaded CPU backend from G:\ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-09-14T00:22:09.495-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: Quadro GV100, compute capability 7.0, VMM: yes Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes load_backend: loaded CUDA backend from G:\ollama\lib\ollama\cuda_v12\ggml-cuda.dll time=2025-09-14T00:22:10.029-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-09-14T00:22:10.136-07:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="525.0 MiB" time=2025-09-14T00:22:10.136-07:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="13.8 GiB" time=2025-09-14T00:22:10.136-07:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA1 size="2.0 GiB" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 9337.48 MiB on device 1: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 9791055360 time=2025-09-14T00:22:11.217-07:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-09-14T00:22:11.217-07:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="9.1 GiB" time=2025-09-14T00:22:11.217-07:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" panic: insufficient memory - required allocations: {InputWeights:550502400A CPU:{Name:CPU Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} GPUs:[{Name:CUDA0 Weights:[388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 388997120A 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A} {Name:CUDA1 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 388997120A 388997120A 1339412480A] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:9791055360F}]} goroutine 15 [running]: github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0xc000612140) C:/a/ollama/ollama/ml/backend/ggml/ggml.go:643 +0x756 github.com/ollama/ollama/runner/ollamarunner.multimodalStore.getTensor(0xc00049b9f8?, {0x7ff7eb5be7f0, 0xc000df2120}, {0x7ff7eb5c2b68, 0xc0011dfe80}, {0x7ff7eb5cec48, 0xc000667638}, 0x1) C:/a/ollama/ollama/runner/ollamarunner/multimodal.go:98 +0x2a4 github.com/ollama/ollama/runner/ollamarunner.multimodalStore.getMultimodal(0xc001073cd8, {0x7ff7eb5be7f0, 0xc000df2120}, {0x7ff7eb5c2b68, 0xc0011dfe80}, {0xc0007140c0, 0x1, 0x7ff7eb23e100?}, 0x1) C:/a/ollama/ollama/runner/ollamarunner/multimodal.go:56 +0xe5 github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc00063dd40) C:/a/ollama/ollama/runner/ollamarunner/runner.go:796 +0x70e github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc00063dd40, {0xc00003a2a0?, 0x0?}, {0xc, 0x0, 0x3e7, {0xc00045d2b0, 0x2, 0x2}, 0x1}, ...) C:/a/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270 github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc00063dd40, {0x7ff7eb5ba7c0, 0xc0004b9590}, {0xc00003a2a0?, 0x0?}, {0xc, 0x0, 0x3e7, {0xc00045d2b0, 0x2, ...}, ...}, ...) C:/a/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 C:/a/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11 time=2025-09-14T00:22:11.314-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-09-14T00:22:11.332-07:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2" time=2025-09-14T00:22:11.565-07:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory" ```
GiteaMirror added the bug label 2026-04-12 20:34:32 -05:00
Author
Owner

@jessegross commented on GitHub (Sep 15, 2025):

layers.requested=999

You are forcing the entire model onto the GPU and getting OOM as a result. 0.7.1 requires the full amount of memory to be available in order to avoid crashes during inference time. 0.11.11 generally handles models like mistral3 better but the result will be the same if there isn't sufficient memory.

<!-- gh-comment-id:3293319991 --> @jessegross commented on GitHub (Sep 15, 2025): `layers.requested=999` You are forcing the entire model onto the GPU and getting OOM as a result. 0.7.1 requires the full amount of memory to be available in order to avoid crashes during inference time. 0.11.11 generally handles models like mistral3 better but the result will be the same if there isn't sufficient memory.
Author
Owner

@thot-experiment commented on GitHub (Sep 15, 2025):

Can you explain further? I'm intentionally forcing the model onto the GPU as I have 43gb of VRAM available and the model weights in this particular case are 17GB, even with context I should have plenty of space. This is mistral 3.1 small, a 24b model. Even with maxed out contexts I don't experience crashes in 0.7.0.

I have run 70b parameter models (albeit with smaller contexts) just fine in the past on this machine, I'm not saying you're wrong but it seems that I'm missing some piece of the puzzle here.

<!-- gh-comment-id:3293405966 --> @thot-experiment commented on GitHub (Sep 15, 2025): Can you explain further? I'm intentionally forcing the model onto the GPU as I have 43gb of VRAM available and the model weights in this particular case are 17GB, even with context I should have plenty of space. This is mistral 3.1 small, a 24b model. Even with maxed out contexts I don't experience crashes in 0.7.0. I have run 70b parameter models (albeit with smaller contexts) just fine in the past on this machine, I'm not saying you're wrong but it seems that I'm missing some piece of the puzzle here.
Author
Owner

@jessegross commented on GitHub (Sep 15, 2025):

mistral-small3.1 has a vision projector with a large compute graph (9G). On earlier versions, low resolution images would use less memory and would sometimes fit in smaller GPUs but would crash if you send a larger image. Current versions of Ollama require the full amount of memory to be available.

I would try it with 0.11.11 without forcing a specific numbers of layers to offload. If it doesn't load the whole thing and you think it should then please post the whole log from that version here.

<!-- gh-comment-id:3293583656 --> @jessegross commented on GitHub (Sep 15, 2025): mistral-small3.1 has a vision projector with a large compute graph (9G). On earlier versions, low resolution images would use less memory and would sometimes fit in smaller GPUs but would crash if you send a larger image. Current versions of Ollama require the full amount of memory to be available. I would try it with 0.11.11 without forcing a specific numbers of layers to offload. If it doesn't load the whole thing and you think it should then please post the whole log from that version here.
Author
Owner

@thot-experiment commented on GitHub (Sep 15, 2025):

Upon testing more cases. 7.1.0 does load the model if i knock the context down to 32k and inference works as expected. At 64k I get the same OOM error.

7.0.0 however works just fine at 128k, and loading up Gulliver's Travels (minus the last chapter, just under 128k tokens) with a needle hidden in the first sentence:

"My father had a small estate in Cambridgehamshire upon Avon;"

vs the original

"My father had a small estate in Nottinghamshire;"

prompting the model with

"Where was the author's father's estate located?"

7.0.0 correctly answers

"The author's father's estate was located in Cambridgehamshire upon Avon."

<!-- gh-comment-id:3293632536 --> @thot-experiment commented on GitHub (Sep 15, 2025): Upon testing more cases. 7.1.0 *does* load the model if i knock the context down to 32k and inference works as expected. At 64k I get the same OOM error. 7.0.0 however works just fine at 128k, and loading up Gulliver's Travels (minus the last chapter, just under 128k tokens) with a needle hidden in the first sentence: > "My father had a small estate in Cambridgehamshire upon Avon;" vs the original > "My father had a small estate in Nottinghamshire;" prompting the model with > "Where was the author's father's estate located?" 7.0.0 correctly answers > "The author's father's estate was located in Cambridgehamshire upon Avon."
Author
Owner

@thot-experiment commented on GitHub (Sep 15, 2025):

Looks like the issue is related to multi-gpu model splitting, 7.1.0 manages to load the model correctly at 128k if I use the CUDA_VISIBLE_DEVICES env var to hide the 1080Ti and force the model only onto the GV100. I am wholly unconvinced that the issues I am experiencing are related to absolute VRAM quantities, a 9gb compute graph + 15gb of weights still leaves me with 8gb for context on just the single GPU, and 19gb across both. Even with whatever OS overhead etc this seems like it's ample space.

<!-- gh-comment-id:3293701493 --> @thot-experiment commented on GitHub (Sep 15, 2025): Looks like the issue is related to multi-gpu model splitting, 7.1.0 manages to load the model correctly at 128k if I use the `CUDA_VISIBLE_DEVICES` env var to hide the 1080Ti and force the model only onto the GV100. I am wholly unconvinced that the issues I am experiencing are related to absolute VRAM quantities, a 9gb compute graph + 15gb of weights still leaves me with 8gb for context on just the single GPU, and 19gb across both. Even with whatever OS overhead etc this seems like it's ample space.
Author
Owner

@Panican-Whyasker commented on GitHub (Sep 16, 2025):

Just filed yet another RAM-related issue (bug), Too Much RAM Eaten in Ollama 0.11.11 #12305

Models that fitted in the 6 GBytes of VRAM of the Quadro RTX3000 earlier now eat e.g. >20 GB of system RAM.
A 42 GB model (on disk) eats ~92 GB of system RAM.

<!-- gh-comment-id:3299113361 --> @Panican-Whyasker commented on GitHub (Sep 16, 2025): Just filed yet another RAM-related issue (bug), Too Much RAM Eaten in Ollama 0.11.11 #12305 Models that fitted in the 6 GBytes of VRAM of the Quadro RTX3000 earlier now eat e.g. >20 GB of system RAM. A 42 GB model (on disk) eats ~92 GB of system RAM.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8164