[GH-ISSUE #13315] Weird Vram-Behavior on Ministral-3-3b #8795

Closed
opened 2026-04-12 21:34:11 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @HerzogVolpe on GitHub (Dec 3, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13315

What is the issue?

When starting to chat with the model, the Vram of my rtx 3060 makes some weird jumps for some time, and then shows empty in the task manager, while generating text. It also takes very long to load.

Image

Relevant log output


OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.13.1

Originally created by @HerzogVolpe on GitHub (Dec 3, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13315 ### What is the issue? When starting to chat with the model, the Vram of my rtx 3060 makes some weird jumps for some time, and then shows empty in the task manager, while generating text. It also takes very long to load. <img width="492" height="445" alt="Image" src="https://github.com/user-attachments/assets/6bc2709f-3423-47b7-bbe4-e9976521981b" /> ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.13.1
GiteaMirror added the bug label 2026-04-12 21:34:11 -05:00
Author
Owner

@HerzogVolpe commented on GitHub (Dec 3, 2025):

btw. this is not an urgent issue or something. I just want to provide some evidence, in case you want/need it.

<!-- gh-comment-id:3606972029 --> @HerzogVolpe commented on GitHub (Dec 3, 2025): btw. this is not an urgent issue or something. I just want to provide some evidence, in case you want/need it.
Author
Owner

@metaligh commented on GitHub (Dec 3, 2025):

All Ministral models are not working...

<!-- gh-comment-id:3607278600 --> @metaligh commented on GitHub (Dec 3, 2025): All Ministral models are not working...
Author
Owner

@HerzogVolpe commented on GitHub (Dec 3, 2025):

Ok? But it did answer and generate...

<!-- gh-comment-id:3607292406 --> @HerzogVolpe commented on GitHub (Dec 3, 2025): Ok? But it did answer and generate...
Author
Owner

@metaligh commented on GitHub (Dec 3, 2025):

No. I have a blank answer.

<!-- gh-comment-id:3607554059 --> @metaligh commented on GitHub (Dec 3, 2025): No. I have a blank answer.
Author
Owner

@LaaZa commented on GitHub (Dec 3, 2025):

On Linux for me they seem to try loading into VRAM but fail at each point because it somehow determines there isn't enough memory and the models end up running 100% on CPU. The VRAM does spike while it tries to load but is left empty when it finally generates on CPU only.

<!-- gh-comment-id:3607693517 --> @LaaZa commented on GitHub (Dec 3, 2025): On Linux for me they seem to try loading into VRAM but fail at each point because it somehow determines there isn't enough memory and the models end up running 100% on CPU. The VRAM does spike while it tries to load but is left empty when it finally generates on CPU only.
Author
Owner

@dan-and commented on GitHub (Dec 3, 2025):

Currently, ministral-3 uses a good chunk of memory:
$ ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
ministral-3:14b 8a5cdca192c0 19 GB 100% GPU 4096 59 minutes from now
ministral-3:8b 77300ee7514e 16 GB 100% GPU 4096 59 minutes from now
ministral-3:3b a48e77f25d79 13 GB 100% GPU 4096 57 minutes from now

<!-- gh-comment-id:3607931258 --> @dan-and commented on GitHub (Dec 3, 2025): Currently, ministral-3 uses a good chunk of memory: $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL ministral-3:14b 8a5cdca192c0 19 GB 100% GPU 4096 59 minutes from now ministral-3:8b 77300ee7514e 16 GB 100% GPU 4096 59 minutes from now ministral-3:3b a48e77f25d79 13 GB 100% GPU 4096 57 minutes from now
Author
Owner

@LaaZa commented on GitHub (Dec 3, 2025):

yeah that also seems out of proportion but for me it does not partially load on GPU so it seems to load on CPU because even the smallest one doesn't fit for some reason.

<!-- gh-comment-id:3607942145 --> @LaaZa commented on GitHub (Dec 3, 2025): yeah that also seems out of proportion but for me it does not partially load on GPU so it seems to load on CPU because even the smallest one doesn't fit for some reason.
Author
Owner

@dan-and commented on GitHub (Dec 3, 2025):

Yes, there may be an issue, which also occurs over multiple GPUs (see https://github.com/ollama/ollama/issues/13313 )

<!-- gh-comment-id:3608039378 --> @dan-and commented on GitHub (Dec 3, 2025): Yes, there may be an issue, which also occurs over multiple GPUs (see https://github.com/ollama/ollama/issues/13313 )
Author
Owner

@tomhanax commented on GitHub (Dec 3, 2025):

same here, ministral-3:3b 0% GPU 100% CPU, GeForce GTX 1060 6GB.
No problem with other "small" models (gemma, qwen, etc..), they all run well on GPU.

<!-- gh-comment-id:3608241092 --> @tomhanax commented on GitHub (Dec 3, 2025): same here, **ministral-3:3b** 0% GPU 100% CPU, GeForce GTX 1060 6GB. No problem with other "small" models (gemma, qwen, etc..), they all run well on GPU.
Author
Owner

@dhogenson commented on GitHub (Dec 3, 2025):

Kind of for me, I'm on linux and when I try and load a ministral-3:3b model it tried to load it into vram, but it decides that there isn't enough vram (I have a rtx 4060) and loads it to cpu.

<!-- gh-comment-id:3608792325 --> @dhogenson commented on GitHub (Dec 3, 2025): Kind of for me, I'm on linux and when I try and load a `ministral-3:3b` model it tried to load it into vram, but it decides that there isn't enough vram (I have a rtx 4060) and loads it to cpu.
Author
Owner

@dabe-19 commented on GitHub (Dec 4, 2025):

Comparing with the logs for mistral-nemo:12b model, which had no issue loading all 41 layers into VRAM using the llama architechture. ministral-3 models seem to be using the mistral3 architechture. All 3 ministral-3 models appear to load 9.1GB into VRAM for the compute graph, leaving very little room for parameters on a 12GB RTX3060, which are then offloaded to the CPU.

Below are logs from running the 3b parameter model, this behavior was shown across 14b, 8b, and 3b.

2025-12-03 21:39:36.014 | time=2025-12-04T03:39:36.014Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-fe5f89ae-25da-f6e1-80d7-f217cbb1a8ed library=CUDA available="10.0 GiB" free="10.5 GiB" minimum="457.0 MiB" overhead="0 B"
2025-12-03 21:39:36.014 | time=2025-12-04T03:39:36.014Z level=INFO source=server.go:702 msg="loading model" "model layers"=27 requested=-1
2025-12-03 21:39:36.015 | time=2025-12-04T03:39:36.015Z level=INFO source=runner.go:1271 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:12 GPULayers:11[ID:GPU-fe5f89ae-25da-f6e1-80d7-f217cbb1a8ed Layers:11(15..25)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
2025-12-03 21:39:36.424 | time=2025-12-04T03:39:36.423Z level=INFO source=runner.go:1271 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:12 GPULayers:10[ID:GPU-fe5f89ae-25da-f6e1-80d7-f217cbb1a8ed Layers:10(16..25)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
2025-12-03 21:39:36.798 | time=2025-12-04T03:39:36.798Z level=INFO source=runner.go:1271 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:12 GPULayers:10[ID:GPU-fe5f89ae-25da-f6e1-80d7-f217cbb1a8ed Layers:10(16..25)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=runner.go:1271 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:12 GPULayers:10[ID:GPU-fe5f89ae-25da-f6e1-80d7-f217cbb1a8ed Layers:10(16..25)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=ggml.go:482 msg="offloading 10 repeating layers to GPU"
2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=ggml.go:494 msg="offloaded 10/27 layers to GPU"
2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="668.7 MiB"
2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=device.go:245 msg="model weights" device=CPU size="2.4 GiB"
2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="160.0 MiB"
2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=device.go:256 msg="kv cache" device=CPU size="256.0 MiB"
2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="9.1 GiB"
2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="6.0 MiB"
2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=device.go:272 msg="total memory" size="12.6 GiB"
2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=sched.go:517 msg="loaded runners" count=1
2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=server.go:1294 msg="waiting for llama runner to start responding"
2025-12-03 21:39:37.480 | time=2025-12-04T03:39:37.480Z level=INFO source=server.go:1328 msg="waiting for server to become available" status="llm server loading model"
2025-12-03 21:39:38.739 | time=2025-12-04T03:39:38.739Z level=INFO source=server.go:1332 msg="llama runner started in 7.63 seconds"
2025-12-03 21:39:38.739 | [GIN] 2025/12/04 - 03:39:38 | 200 |  7.827676741s |       127.0.0.1 | POST     "/api/generate"
2025-12-03 21:44:48.249 | ggml_backend_cuda_device_get_memory device GPU-fe5f89ae-25da-f6e1-80d7-f217cbb1a8ed utilizing NVML memory reporting free: 565321728 total: 12884901888
2025-12-03 21:44:51.504 | time=2025-12-04T03:44:51.503Z level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 41563"
<!-- gh-comment-id:3609968353 --> @dabe-19 commented on GitHub (Dec 4, 2025): Comparing with the logs for mistral-nemo:12b model, which had no issue loading all 41 layers into VRAM using the llama architechture. ministral-3 models seem to be using the mistral3 architechture. All 3 ministral-3 models appear to load 9.1GB into VRAM for the compute graph, leaving very little room for parameters on a 12GB RTX3060, which are then offloaded to the CPU. Below are logs from running the 3b parameter model, this behavior was shown across 14b, 8b, and 3b. ```text 2025-12-03 21:39:36.014 | time=2025-12-04T03:39:36.014Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-fe5f89ae-25da-f6e1-80d7-f217cbb1a8ed library=CUDA available="10.0 GiB" free="10.5 GiB" minimum="457.0 MiB" overhead="0 B" 2025-12-03 21:39:36.014 | time=2025-12-04T03:39:36.014Z level=INFO source=server.go:702 msg="loading model" "model layers"=27 requested=-1 2025-12-03 21:39:36.015 | time=2025-12-04T03:39:36.015Z level=INFO source=runner.go:1271 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:12 GPULayers:11[ID:GPU-fe5f89ae-25da-f6e1-80d7-f217cbb1a8ed Layers:11(15..25)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" 2025-12-03 21:39:36.424 | time=2025-12-04T03:39:36.423Z level=INFO source=runner.go:1271 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:12 GPULayers:10[ID:GPU-fe5f89ae-25da-f6e1-80d7-f217cbb1a8ed Layers:10(16..25)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" 2025-12-03 21:39:36.798 | time=2025-12-04T03:39:36.798Z level=INFO source=runner.go:1271 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:12 GPULayers:10[ID:GPU-fe5f89ae-25da-f6e1-80d7-f217cbb1a8ed Layers:10(16..25)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" 2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=runner.go:1271 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:12 GPULayers:10[ID:GPU-fe5f89ae-25da-f6e1-80d7-f217cbb1a8ed Layers:10(16..25)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" 2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=ggml.go:482 msg="offloading 10 repeating layers to GPU" 2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=ggml.go:486 msg="offloading output layer to CPU" 2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=ggml.go:494 msg="offloaded 10/27 layers to GPU" 2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="668.7 MiB" 2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=device.go:245 msg="model weights" device=CPU size="2.4 GiB" 2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="160.0 MiB" 2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=device.go:256 msg="kv cache" device=CPU size="256.0 MiB" 2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="9.1 GiB" 2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="6.0 MiB" 2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=device.go:272 msg="total memory" size="12.6 GiB" 2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=sched.go:517 msg="loaded runners" count=1 2025-12-03 21:39:37.479 | time=2025-12-04T03:39:37.479Z level=INFO source=server.go:1294 msg="waiting for llama runner to start responding" 2025-12-03 21:39:37.480 | time=2025-12-04T03:39:37.480Z level=INFO source=server.go:1328 msg="waiting for server to become available" status="llm server loading model" 2025-12-03 21:39:38.739 | time=2025-12-04T03:39:38.739Z level=INFO source=server.go:1332 msg="llama runner started in 7.63 seconds" 2025-12-03 21:39:38.739 | [GIN] 2025/12/04 - 03:39:38 | 200 | 7.827676741s | 127.0.0.1 | POST "/api/generate" 2025-12-03 21:44:48.249 | ggml_backend_cuda_device_get_memory device GPU-fe5f89ae-25da-f6e1-80d7-f217cbb1a8ed utilizing NVML memory reporting free: 565321728 total: 12884901888 2025-12-03 21:44:51.504 | time=2025-12-04T03:44:51.503Z level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 41563" ```
Author
Owner

@LaaZa commented on GitHub (Dec 5, 2025):

0.13.2-rc1 fixed this for me on Linux Cuda, memory usage is also reasonable.

<!-- gh-comment-id:3617823620 --> @LaaZa commented on GitHub (Dec 5, 2025): `0.13.2-rc1` fixed this for me on Linux Cuda, memory usage is also reasonable.
Author
Owner

@metaligh commented on GitHub (Dec 7, 2025):

Does not work on Windows 11. The model outputs a lot of garbage to the console, but the model does not load.

<!-- gh-comment-id:3622701820 --> @metaligh commented on GitHub (Dec 7, 2025): Does not work on Windows 11. The model outputs a lot of garbage to the console, but the model does not load.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8795