[GH-ISSUE #10615] mistral-small3.1:24b q4 use 100% CPU when change num_ctx to 128K #53495

Closed
opened 2026-04-29 03:24:56 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @Seraphli on GitHub (May 8, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10615

What is the issue?

If use default num_ctx

NAME                                         ID              SIZE     PROCESSOR    UNTIL              
mistral-small3.1:24b-instruct-2503-q4_K_M    b9aaf0c2586a    29 GB    100% GPU     4 minutes from now    

ollama show mistral-small3.1:24b-instruct-2503-q4_K_M

  Model
    architecture        mistral3    
    parameters          24.0B       
    context length      131072      
    embedding length    5120        
    quantization        Q4_K_M      

  Capabilities
    completion    
    vision        
    tools         

  Parameters
    num_ctx    4096    

  System
    You are Mistral Small 3.1, a Large Language Model (LLM) created by Mistral AI, a French startup         
      headquartered in Paris.                                                                                 
    You power an AI assistant called Le Chat.                                                               

And I change the num_ctx to 128K

>>> /set parameter num_ctx 131072
Set parameter 'num_ctx' to '131072'

The model fully on CPU

NAME                                         ID              SIZE     PROCESSOR    UNTIL              
mistral-small3.1:24b-instruct-2503-q4_K_M    b9aaf0c2586a    36 GB    100% CPU     4 minutes from now    

But 2x 3090 should be enough to hold a 36GB model, right?

Thu May  8 10:39:46 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:21:00.0  On |                  N/A |
|  0%   31C    P8              8W /  350W |       4MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:49:00.0 Off |                  N/A |
|  0%   32C    P8             14W /  350W |       4MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Relevant log output


OS

No response

GPU

Nvidia

CPU

Intel

Ollama version

0.6.8

Originally created by @Seraphli on GitHub (May 8, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10615 ### What is the issue? If use default num_ctx ``` NAME ID SIZE PROCESSOR UNTIL mistral-small3.1:24b-instruct-2503-q4_K_M b9aaf0c2586a 29 GB 100% GPU 4 minutes from now ``` `ollama show mistral-small3.1:24b-instruct-2503-q4_K_M` ``` Model architecture mistral3 parameters 24.0B context length 131072 embedding length 5120 quantization Q4_K_M Capabilities completion vision tools Parameters num_ctx 4096 System You are Mistral Small 3.1, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris. You power an AI assistant called Le Chat. ``` And I change the `num_ctx` to 128K ``` >>> /set parameter num_ctx 131072 Set parameter 'num_ctx' to '131072' ``` The model fully on CPU ``` NAME ID SIZE PROCESSOR UNTIL mistral-small3.1:24b-instruct-2503-q4_K_M b9aaf0c2586a 36 GB 100% CPU 4 minutes from now ``` But 2x 3090 should be enough to hold a 36GB model, right? ``` Thu May 8 10:39:46 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 On | 00000000:21:00.0 On | N/A | | 0% 31C P8 8W / 350W | 4MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 On | 00000000:49:00.0 Off | N/A | | 0% 32C P8 14W / 350W | 4MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ ``` ### Relevant log output ```shell ``` ### OS _No response_ ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.8
GiteaMirror added the bug label 2026-04-29 03:24:56 -05:00
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:2862830630 --> @rick-github commented on GitHub (May 8, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@Seraphli commented on GitHub (May 8, 2025):

ollama_slog.txt

Commands to get this log

~ ❯ ollama run mistral-small3.1:24b-instruct-2503-q4_K_M 
>>> /set parameter num_ctx 131072
Set parameter 'num_ctx' to '131072'
>>> Who are you?
I am an AI assistant created by Mistral AI, a leading AI company based in Paris. I'm here to help answer your questions, provide information, and assist you with 
various tasks to the best of my ability. I don't have personal experiences, feelings, or a physical presence, but I'm designed to process and generate text based 
on the data I've been trained on (up to 2023).

>>> Send a message (/? for help)
<!-- gh-comment-id:2863139988 --> @Seraphli commented on GitHub (May 8, 2025): [ollama_slog.txt](https://github.com/user-attachments/files/20103968/ollama_slog.txt) Commands to get this log ``` ~ ❯ ollama run mistral-small3.1:24b-instruct-2503-q4_K_M >>> /set parameter num_ctx 131072 Set parameter 'num_ctx' to '131072' >>> Who are you? I am an AI assistant created by Mistral AI, a leading AI company based in Paris. I'm here to help answer your questions, provide information, and assist you with various tasks to the best of my ability. I don't have personal experiences, feelings, or a physical presence, but I'm designed to process and generate text based on the data I've been trained on (up to 2023). >>> Send a message (/? for help) ```
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

May 08 20:22:10 ollama[18751]: time=2025-05-08T20:22:10.813+08:00 level=INFO source=server.go:139 msg=offload library=cuda
 layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[8.4 GiB 23.3 GiB]"
 memory.gpu_overhead="0 B" memory.required.full="33.9 GiB" memory.required.partial="0 B" memory.required.kv="20.0 GiB"
 memory.required.allocations="[0 B 0 B]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB"
 memory.weights.nonrepeating="360.0 MiB" memory.graph.full="13.3 GiB" memory.graph.partial="13.3 GiB"
 projector.weights="769.3 MiB" projector.graph="8.8 GiB"

31.7G available. KV requires 20G. Memory graph requires 13G. Projector requires 9.5G. Some model data structures need to be duplicated across devices. What this means is that there is not enough memory left over to load a single layer, so the whole model is run in RAM.

<!-- gh-comment-id:2863287354 --> @rick-github commented on GitHub (May 8, 2025): ``` May 08 20:22:10 ollama[18751]: time=2025-05-08T20:22:10.813+08:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[8.4 GiB 23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="33.9 GiB" memory.required.partial="0 B" memory.required.kv="20.0 GiB" memory.required.allocations="[0 B 0 B]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="13.3 GiB" memory.graph.partial="13.3 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB" ``` 31.7G available. KV requires 20G. Memory graph requires 13G. Projector requires 9.5G. Some model data structures need to be duplicated across devices. What this means is that there is not enough memory left over to load a single layer, so the whole model is run in RAM.
Author
Owner

@Seraphli commented on GitHub (May 9, 2025):

I forgot to clear the GPU during the previous test. This is a new test and I’ve cleared the GPU this time.

May 09 08:13:45 ollama[18751]: time=2025-05-09T08:13:45.498+08:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[23.3 GiB 23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="33.9 GiB" memory.required.partial="0 B" memory.required.kv="20.0 GiB" memory.required.allocations="[0 B 0 B]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="13.3 GiB" memory.graph.partial="13.3 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB"

And may I ask roughly how much VRAM is needed to run on two GPUs? Does KV needs to be duplicated across devices?
Another thing is even after clearing the GPU, the available memory shows 23.3 GiB rather than 24 GiB.

<!-- gh-comment-id:2864779932 --> @Seraphli commented on GitHub (May 9, 2025): I forgot to clear the GPU during the previous test. This is a new test and I’ve cleared the GPU this time. ``` May 09 08:13:45 ollama[18751]: time=2025-05-09T08:13:45.498+08:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[23.3 GiB 23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="33.9 GiB" memory.required.partial="0 B" memory.required.kv="20.0 GiB" memory.required.allocations="[0 B 0 B]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="13.3 GiB" memory.graph.partial="13.3 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB" ``` And may I ask roughly how much VRAM is needed to run on two GPUs? Does KV needs to be duplicated across devices? Another thing is even after clearing the GPU, the available memory shows 23.3 GiB rather than 24 GiB.
Author
Owner

@Seraphli commented on GitHub (May 9, 2025):

Here is another test with 64K context length.

May 09 08:44:25 ollama[18751]: time=2025-05-09T08:44:25.908+08:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=37 layers.split=10,27 memory.available="[23.3 GiB 23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="48.0 GiB" memory.required.partial="45.9 GiB" memory.required.kv="10.0 GiB" memory.required.allocations="[22.9 GiB 23.0 GiB]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="6.7 GiB" memory.graph.partial="6.7 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB"

Is memory.required.allocations="[22.9 GiB 23.0 GiB]" the memory that finally used? But I get this usage of VRAM.

Image

The VRAMs are not fully utilized.

<!-- gh-comment-id:2864794515 --> @Seraphli commented on GitHub (May 9, 2025): Here is another test with 64K context length. ``` May 09 08:44:25 ollama[18751]: time=2025-05-09T08:44:25.908+08:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=37 layers.split=10,27 memory.available="[23.3 GiB 23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="48.0 GiB" memory.required.partial="45.9 GiB" memory.required.kv="10.0 GiB" memory.required.allocations="[22.9 GiB 23.0 GiB]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="6.7 GiB" memory.graph.partial="6.7 GiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB" ``` Is `memory.required.allocations="[22.9 GiB 23.0 GiB]"` the memory that finally used? But I get this usage of VRAM. <img width="1603" alt="Image" src="https://github.com/user-attachments/assets/00a8facb-8b9a-4501-b6b3-7f4c9948a212" /> The VRAMs are not fully utilized.
Author
Owner

@pihapi commented on GitHub (Jun 8, 2025):

RTX 3060 * 2 = 24Gb + RAM 68Gb This is the maximum.

num_ctx = 4096

PS C:\Users\u> ollama ps
NAME                                         ID              SIZE     PROCESSOR          UNTIL
mistral-small3.1:24b-instruct-2503-q4_K_M    b9aaf0c2586a    33 GB    39%/61% CPU/GPU    24 hours from now

num_ctx = 32768

PS C:\Users\u> ollama ps
NAME                                         ID              SIZE     PROCESSOR    UNTIL
mistral-small3.1:24b-instruct-2503-q4_K_M    b9aaf0c2586a    51 GB    100% CPU     24 hours from now

When changing num_ctx, the GPU memory was reset to 0 and everything was loaded into RAM memory. The GPUs were left out of work.

num_ctx = 131072

time=2025-06-08T17:48:11.705+05:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: exit status 2"

<!-- gh-comment-id:2954074119 --> @pihapi commented on GitHub (Jun 8, 2025): RTX 3060 * 2 = 24Gb + RAM 68Gb This is the maximum. **num_ctx = 4096** ``` PS C:\Users\u> ollama ps NAME ID SIZE PROCESSOR UNTIL mistral-small3.1:24b-instruct-2503-q4_K_M b9aaf0c2586a 33 GB 39%/61% CPU/GPU 24 hours from now ``` **num_ctx = 32768** ``` PS C:\Users\u> ollama ps NAME ID SIZE PROCESSOR UNTIL mistral-small3.1:24b-instruct-2503-q4_K_M b9aaf0c2586a 51 GB 100% CPU 24 hours from now ``` When changing num_ctx, the GPU memory was reset to 0 and everything was loaded into RAM memory. The GPUs were left out of work. **num_ctx = 131072** `time=2025-06-08T17:48:11.705+05:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: exit status 2"`
Author
Owner

@Seraphli commented on GitHub (Jun 8, 2025):

Anyway, I tested the maximum context length for mistral-small3.1:24b-q4 is about 56K. Save someone time.

<!-- gh-comment-id:2954104468 --> @Seraphli commented on GitHub (Jun 8, 2025): Anyway, I tested the maximum context length for `mistral-small3.1:24b-q4` is about 56K. Save someone time.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53495