[GH-ISSUE #10231] Mistral-small3.1:24b Not Fully Utilizing A10 GPU #68770

Open
opened 2026-05-04 15:07:58 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @talsan74 on GitHub (Apr 11, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10231

I’m running Ollama on an A10 GPU and noticed that models like mistral-small3.1:24b are only utilizing around 12GB of GPU memory, even though the card has 24GB available.

Docker Command Used

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker exec -it ollama ollama run mistral-small3.1:24b

same for gemma3

mistral-small3.1:24b

Image

In comparison, when I run gemma3:27b, it utilizes 100% of the GPU — which seems like expected behavior

Image

Actual Behavior

While running the model, GPU usage seems to peak at around 12GB for mistral-small3.1:24b. I'm unsure if:
This is expected behavior,
It’s a bug,
Or if additional configuration is needed to make full use of the GPU memory.

Expected Behavior

The mistral-small3.1:24b model should utilize most or all available GPU memory (similar to how gemma3:27b does) to fully leverage A10’s 24GB VRAM.

System Info
Type (AWS): g5.2xlarge
CPU: 8 vCPUs
Memory: 32 GB
GPU: NVIDIA A10 (1 unit)
Ollama version: 0.6.5

Originally created by @talsan74 on GitHub (Apr 11, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10231 I’m running Ollama on an A10 GPU and noticed that models like mistral-small3.1:24b are only utilizing around 12GB of GPU memory, even though the card has 24GB available. **Docker Command Used** ``` docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama docker exec -it ollama ollama run mistral-small3.1:24b ``` same for gemma3 mistral-small3.1:24b ![Image](https://github.com/user-attachments/assets/2db5ae6b-9a67-4270-9f00-82d08d3cfe4c) In comparison, when I run gemma3:27b, it utilizes 100% of the GPU — which seems like expected behavior ![Image](https://github.com/user-attachments/assets/f7ad7f37-51ff-4ef2-b2f8-ca65b47829ad) **Actual Behavior** While running the model, GPU usage seems to peak at around 12GB for mistral-small3.1:24b. I'm unsure if: This is expected behavior, It’s a bug, Or if additional configuration is needed to make full use of the GPU memory. **Expected Behavior** The mistral-small3.1:24b model should utilize most or all available GPU memory (similar to how gemma3:27b does) to fully leverage A10’s 24GB VRAM. **System Info** Type (AWS): g5.2xlarge CPU: 8 vCPUs Memory: 32 GB GPU: NVIDIA A10 (1 unit) Ollama version: 0.6.5
Author
Owner

@max40in commented on GitHub (Apr 12, 2025):

Same problem
gemma3 loads fine (20%/80% is not enough video memory to load 100% into video memory, but that's not my problem)
Image
problem with mistral small 3.1 which runs 100% on cpu and doesn't use gpu at all
Image
Image
I didn't set any settings. Everything is default.
AMD Ryzen 5 2600 (6/12)
64 Gb RAM
GPU:
2x 3060ti
1x 1070ti
Ollama version: 0.6.5

<!-- gh-comment-id:2798779881 --> @max40in commented on GitHub (Apr 12, 2025): Same problem gemma3 loads fine (20%/80% is not enough video memory to load 100% into video memory, but that's not my problem) ![Image](https://github.com/user-attachments/assets/23ea91f2-fe78-414f-945d-ef78a290f63e) problem with mistral small 3.1 which runs 100% on cpu and doesn't use gpu at all ![Image](https://github.com/user-attachments/assets/ea6deabc-5f3a-4664-a183-257a20eab9de) ![Image](https://github.com/user-attachments/assets/9c4c3413-d201-41e5-a985-1a4737d5cc21) I didn't set any settings. Everything is default. AMD Ryzen 5 2600 (6/12) 64 Gb RAM GPU: 2x 3060ti 1x 1070ti Ollama version: 0.6.5
Author
Owner

@Cryztalzone commented on GitHub (Apr 12, 2025):

Same problem here, Gemma3:27B uses 15GB with 4.5GB spilling to RAM as seen here:
Image
Mistral-Small3.1:24B uses only 10GB with 1.6GB in RAM, although I'm not entirely sure if this is even related to Ollama:
Image

Also ollama ps with the running models shows Mistral as 26GB while Gemma has only 21GB, however the files themselves are the exact opposite, Mistral has 15GB while Gemma has 17GB
Image
Image

Ollama 0.6.5
Ryzen 5800X
64GB RAM
RX 9070 XT on Windows (Using libs from https://github.com/likelovewant/ollama-for-amd; Since I am on a modified installation I wasn't sure if that's the problem but it seems to also occur on NVIDIA setups)

<!-- gh-comment-id:2798842959 --> @Cryztalzone commented on GitHub (Apr 12, 2025): Same problem here, Gemma3:27B uses 15GB with 4.5GB spilling to RAM as seen here: ![Image](https://github.com/user-attachments/assets/6aeec2b6-c3d6-4b11-a216-c314279ddee5) Mistral-Small3.1:24B uses only 10GB with 1.6GB in RAM, although I'm not entirely sure if this is even related to Ollama: ![Image](https://github.com/user-attachments/assets/c5b24d65-5fbf-40fd-9043-fbc3b1d059a2) Also `ollama ps` with the running models shows Mistral as 26GB while Gemma has only 21GB, however the files themselves are the exact opposite, Mistral has 15GB while Gemma has 17GB ![Image](https://github.com/user-attachments/assets/543a8570-24c0-4274-992f-fcbed57144e8) ![Image](https://github.com/user-attachments/assets/c9a76235-5f41-462f-8540-835a0ac680b1) Ollama 0.6.5 Ryzen 5800X 64GB RAM RX 9070 XT on Windows (Using libs from https://github.com/likelovewant/ollama-for-amd; Since I am on a modified installation I wasn't sure if that's the problem but it seems to also occur on NVIDIA setups)
Author
Owner

@Cryztalzone commented on GitHub (Apr 12, 2025):

Using the same model and quantization (Mistral-Small-3.1-24B-2503-Q4_K_M) imported from Huggingface the usage is 94% GPU and 6% CPU which is close to expected for 16GB VRAM. Also the running model is now listed as 15GB rather than 26GB
Image

<!-- gh-comment-id:2798996932 --> @Cryztalzone commented on GitHub (Apr 12, 2025): Using the same model and quantization (Mistral-Small-3.1-24B-2503-Q4_K_M) imported from Huggingface the usage is 94% GPU and 6% CPU which is close to expected for 16GB VRAM. Also the running model is now listed as 15GB rather than 26GB ![Image](https://github.com/user-attachments/assets/bd44c195-f492-455f-80c0-b3fdf45b12db)
Author
Owner

@mbeltagy commented on GitHub (Apr 13, 2025):

I have the same problem on my 4090
nvidia smi
Image
ollam ps

Image

<!-- gh-comment-id:2799850792 --> @mbeltagy commented on GitHub (Apr 13, 2025): I have the same problem on my 4090 `nvidia smi` ![Image](https://github.com/user-attachments/assets/a5b985e9-c6ce-44fd-9f51-48bc0cd6dafd) `ollam ps` ![Image](https://github.com/user-attachments/assets/75640148-468b-441f-ab5e-2818d5c365f0)
Author
Owner

@RubenMercadePrieto commented on GitHub (Apr 14, 2025):

Same issue here with an RTX4090. Mistral-small3 is perfectly fine, but 3.1 is 26Gb and needs to use CPU... :(

<!-- gh-comment-id:2801041098 --> @RubenMercadePrieto commented on GitHub (Apr 14, 2025): Same issue here with an RTX4090. Mistral-small3 is perfectly fine, but 3.1 is 26Gb and needs to use CPU... :(
Author
Owner

@RubenMercadePrieto commented on GitHub (Apr 14, 2025):

Meanwhile the GPU problem is still there, I can say that using /set parameter num_gpu 41 as suggested elsewere, at least the eval rate increases from 14 to 45 token/s...

<!-- gh-comment-id:2801137799 --> @RubenMercadePrieto commented on GitHub (Apr 14, 2025): Meanwhile the GPU problem is still there, I can say that using /set parameter num_gpu 41 as suggested elsewere, at least the eval rate increases from 14 to 45 token/s...
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68770