[GH-ISSUE #10479] Shared VRAM is always used when loading Qwen3:235b, resulting in severe performance downgrade #32652

Open
opened 2026-04-22 14:18:56 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @PC-DOS on GitHub (Apr 29, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10479

What is the issue?

Hi there,

I'm not sure if it's a bug related with Ollama, Qwen3:235b model config or NVIDIA/Windows' strange logics. When loading Qwen3:235b, shared VRAM is always occupied, causing severe performace downgrade. Larger models (like DeepSeek-R1:671b (Q4) or it's Q1.58/Q2.51 variants doesn't have this problem.

When loading Qwen3:235b using ollama run qwen3:235b --verbose, 20 GB of VRAM is occupied on both of my RTX 3090. And shared VRAM usage is 95 GB out of 256 GB. When testing some simple response (sending "你好"(means "Hello" in English)), the eval speed drops to nearly 0.2 tok/s, which is unbearable. Setting CUDA - Sysmem Fallback Policy to "Prefer no fallback" in NVIDIA control panel gives no help.

Then I checked out a copy of Qwen3:235b using ollama show --model file qwen3:235b > Modelfile, Adding a PARAMETER num_gpu 0 line to disable GPU offloading, creating the model with ollama create qwen3:235b-gpulimited. The response rate boosts to about 3~4 token/s.

CPU

2x AMD EPYC 9654

RAM

16x DDR5 RECC 4800MT/s

GPU

2x NVIDIA GeForce RTX3090 24GB

OS

Windows Server 2022, Version 10.0.20348.2700

Server Log

Default: server-qwen3-235b-default.log

No offloading: server-qwen3-235b-nooffloading.log

Chatting Log

Default: chat-qwen3-235b-default.log

No offloading: chat-qwen3-235b-nooffloading.log

Modelfile

Default: Modelfile-qwen3-235b-default.txt

No offloading: Modelfile-qwen3-235b-nooffloading.txt

Relevant log output

For server.log using qwen3:235b, please refer to server-qwen3-235b-default.log.

For chatting log using qwen3:235b, please refer to chat-qwen3-235b-default.log.

For server.log using qwen3:235b without GPU offloading, please refer to server-qwen3-235b-nooffloading.log.

For chatting log using qwen3:235b without GPU offloading, please refer to chat-qwen3-235b-nooffloading.log.

Model file of qwen3:235b is presented in Modelfile-qwen3-235b-default.txt.

Model file of qwen3:235b without GPU offloading is presented in Modelfile-qwen3-235b-nooffloading.txt.

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.6.6

Originally created by @PC-DOS on GitHub (Apr 29, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10479 ### What is the issue? Hi there, I'm not sure if it's a bug related with Ollama, Qwen3:235b model config or NVIDIA/Windows' strange logics. When loading Qwen3:235b, shared VRAM is always occupied, causing severe performace downgrade. Larger models (like DeepSeek-R1:671b (Q4) or it's Q1.58/Q2.51 variants doesn't have this problem. When loading Qwen3:235b using `ollama run qwen3:235b --verbose`, 20 GB of VRAM is occupied on both of my RTX 3090. And shared VRAM usage is 95 GB out of 256 GB. When testing some simple response (sending "你好"(means "Hello" in English)), the eval speed drops to nearly 0.2 tok/s, which is unbearable. Setting CUDA - Sysmem Fallback Policy to "Prefer no fallback" in NVIDIA control panel gives no help. Then I checked out a copy of Qwen3:235b using `ollama show --model file qwen3:235b > Modelfile`, Adding a `PARAMETER num_gpu 0` line to disable GPU offloading, creating the model with `ollama create qwen3:235b-gpulimited`. The response rate boosts to about 3~4 token/s. **CPU** 2x AMD EPYC 9654 **RAM** 16x DDR5 RECC 4800MT/s **GPU** 2x NVIDIA GeForce RTX3090 24GB **OS** Windows Server 2022, Version 10.0.20348.2700 **Server Log** Default: [server-qwen3-235b-default.log](https://github.com/user-attachments/files/19964865/server-qwen3-235b-default.log) No offloading: [server-qwen3-235b-nooffloading.log](https://github.com/user-attachments/files/19964864/server-qwen3-235b-nooffloading.log) **Chatting Log** Default: [chat-qwen3-235b-default.log](https://github.com/user-attachments/files/19964862/chat-qwen3-235b-default.log) No offloading: [chat-qwen3-235b-nooffloading.log](https://github.com/user-attachments/files/19964863/chat-qwen3-235b-nooffloading.log) **Modelfile** Default: [Modelfile-qwen3-235b-default.txt](https://github.com/user-attachments/files/19964888/Modelfile-qwen3-235b-default.txt) No offloading: [Modelfile-qwen3-235b-nooffloading.txt](https://github.com/user-attachments/files/19964889/Modelfile-qwen3-235b-nooffloading.txt) ### Relevant log output ```shell For server.log using qwen3:235b, please refer to server-qwen3-235b-default.log. For chatting log using qwen3:235b, please refer to chat-qwen3-235b-default.log. For server.log using qwen3:235b without GPU offloading, please refer to server-qwen3-235b-nooffloading.log. For chatting log using qwen3:235b without GPU offloading, please refer to chat-qwen3-235b-nooffloading.log. Model file of qwen3:235b is presented in Modelfile-qwen3-235b-default.txt. Model file of qwen3:235b without GPU offloading is presented in Modelfile-qwen3-235b-nooffloading.txt. ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.6.6
GiteaMirror added the bug label 2026-04-22 14:18:56 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 29, 2025):

ollama generally doesn't use large amounts shared memory unless explicitly told to. Normally it's just a small amount of overflow during the inference process when transient memory allocations are required for short lived buffers.

time=2025-04-30T00:53:53.924+08:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1
 layers.model=95 layers.offload=26 layers.split=13,13 memory.available="[22.6 GiB 22.8 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="145.9 GiB" memory.required.partial="43.1 GiB" memory.required.kv="376.0 MiB"
 memory.required.allocations="[21.5 GiB 21.5 GiB]" memory.weights.total="132.3 GiB" memory.weights.repeating="131.8 GiB"
 memory.weights.nonrepeating="486.9 MiB" memory.graph.full="1002.7 MiB" memory.graph.partial="1002.7 MiB"

load_tensors: offloading 26 repeating layers to GPU
load_tensors: offloaded 26/95 layers to GPU
load_tensors:    CUDA_Host model buffer size = 97489.56 MiB
load_tensors:        CUDA0 model buffer size = 18201.04 MiB
load_tensors:        CUDA1 model buffer size = 19785.04 MiB

From the log, ollama has allocated a bit less than 20G to each GPU, and used 97G of system RAM to hold the rest of the model. There's nothing here that indicates ollama is using shared memory.

However, you are seeing a performance degradation and you have a high thread count (384). There's a possibility that you are suffering from thread contention, although this usually affects CPU-only model loads and not hybrid loads. It's easy to check though, just start a CLI chat session and run /set parameter num_thread 192 and see if the performance changes.

<!-- gh-comment-id:2840270756 --> @rick-github commented on GitHub (Apr 29, 2025): ollama generally doesn't use large amounts shared memory unless explicitly told to. Normally it's just a small amount of overflow during the inference process when transient memory allocations are required for short lived buffers. ``` time=2025-04-30T00:53:53.924+08:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=95 layers.offload=26 layers.split=13,13 memory.available="[22.6 GiB 22.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="145.9 GiB" memory.required.partial="43.1 GiB" memory.required.kv="376.0 MiB" memory.required.allocations="[21.5 GiB 21.5 GiB]" memory.weights.total="132.3 GiB" memory.weights.repeating="131.8 GiB" memory.weights.nonrepeating="486.9 MiB" memory.graph.full="1002.7 MiB" memory.graph.partial="1002.7 MiB" load_tensors: offloading 26 repeating layers to GPU load_tensors: offloaded 26/95 layers to GPU load_tensors: CUDA_Host model buffer size = 97489.56 MiB load_tensors: CUDA0 model buffer size = 18201.04 MiB load_tensors: CUDA1 model buffer size = 19785.04 MiB ``` From the log, ollama has allocated a bit less than 20G to each GPU, and used 97G of system RAM to hold the rest of the model. There's nothing here that indicates ollama is using shared memory. However, you are seeing a performance degradation and you have a high thread count (384). There's a possibility that you are suffering from [thread contention](https://github.com/ollama/ollama/issues/10022#issuecomment-2761681481), although this usually affects CPU-only model loads and not hybrid loads. It's easy to check though, just start a CLI chat session and run `/set parameter num_thread 192` and see if the performance changes.
Author
Owner

@PC-DOS commented on GitHub (Apr 30, 2025):

ollama generally doesn't use large amounts shared memory unless explicitly told to. Normally it's just a small amount of overflow during the inference process when transient memory allocations are required for short lived buffers.

time=2025-04-30T00:53:53.924+08:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1
 layers.model=95 layers.offload=26 layers.split=13,13 memory.available="[22.6 GiB 22.8 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="145.9 GiB" memory.required.partial="43.1 GiB" memory.required.kv="376.0 MiB"
 memory.required.allocations="[21.5 GiB 21.5 GiB]" memory.weights.total="132.3 GiB" memory.weights.repeating="131.8 GiB"
 memory.weights.nonrepeating="486.9 MiB" memory.graph.full="1002.7 MiB" memory.graph.partial="1002.7 MiB"

load_tensors: offloading 26 repeating layers to GPU
load_tensors: offloaded 26/95 layers to GPU
load_tensors:    CUDA_Host model buffer size = 97489.56 MiB
load_tensors:        CUDA0 model buffer size = 18201.04 MiB
load_tensors:        CUDA1 model buffer size = 19785.04 MiB

From the log, ollama has allocated a bit less than 20G to each GPU, and used 97G of system RAM to hold the rest of the model. There's nothing here that indicates ollama is using shared memory.

However, you are seeing a performance degradation and you have a high thread count (384). There's a possibility that you are suffering from thread contention, although this usually affects CPU-only model loads and not hybrid loads. It's easy to check though, just start a CLI chat session and run /set parameter num_thread 192 and see if the performance changes.

Thanks for your kind support! I've tested changing num_thread, it worked, but the performance downgrade is still very significant.

Qwen3:235b default config:

Image

Qwen3:235b no offloading:

Image

Also, there is notable shared VRAM occupation in Qwen3:235b default config

<!-- gh-comment-id:2840657057 --> @PC-DOS commented on GitHub (Apr 30, 2025): > ollama generally doesn't use large amounts shared memory unless explicitly told to. Normally it's just a small amount of overflow during the inference process when transient memory allocations are required for short lived buffers. > > ``` > time=2025-04-30T00:53:53.924+08:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 > layers.model=95 layers.offload=26 layers.split=13,13 memory.available="[22.6 GiB 22.8 GiB]" memory.gpu_overhead="0 B" > memory.required.full="145.9 GiB" memory.required.partial="43.1 GiB" memory.required.kv="376.0 MiB" > memory.required.allocations="[21.5 GiB 21.5 GiB]" memory.weights.total="132.3 GiB" memory.weights.repeating="131.8 GiB" > memory.weights.nonrepeating="486.9 MiB" memory.graph.full="1002.7 MiB" memory.graph.partial="1002.7 MiB" > > load_tensors: offloading 26 repeating layers to GPU > load_tensors: offloaded 26/95 layers to GPU > load_tensors: CUDA_Host model buffer size = 97489.56 MiB > load_tensors: CUDA0 model buffer size = 18201.04 MiB > load_tensors: CUDA1 model buffer size = 19785.04 MiB > ``` > > From the log, ollama has allocated a bit less than 20G to each GPU, and used 97G of system RAM to hold the rest of the model. There's nothing here that indicates ollama is using shared memory. > > However, you are seeing a performance degradation and you have a high thread count (384). There's a possibility that you are suffering from [thread contention](https://github.com/ollama/ollama/issues/10022#issuecomment-2761681481), although this usually affects CPU-only model loads and not hybrid loads. It's easy to check though, just start a CLI chat session and run `/set parameter num_thread 192` and see if the performance changes. Thanks for your kind support! I've tested changing `num_thread`, it worked, but the performance downgrade is still very significant. Qwen3:235b default config: ![Image](https://github.com/user-attachments/assets/84cab150-ec71-46fe-b9f2-5dab4c7d4910) Qwen3:235b no offloading: ![Image](https://github.com/user-attachments/assets/0ca59898-ff01-48fb-8def-3e92c87d1c9f) Also, there is notable shared VRAM occupation in Qwen3:235b default config
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32652