[GH-ISSUE #11702] gpt-oss:120b take 99GB vram #69804

Closed
opened 2026-05-04 19:24:33 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @HuChundong on GitHub (Aug 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11702

What is the issue?

gpt-oss:120b d98fe6ba01e6 99 GB 10%/90% CPU/GPU 8192 Forever

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_CONTEXT_LENGTH=8192"

how can i reduce vram? 4x2080ti 22GB. hope all layers can run on gpus.

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @HuChundong on GitHub (Aug 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11702 ### What is the issue? gpt-oss:120b d98fe6ba01e6 99 GB 10%/90% CPU/GPU 8192 Forever [Service] Environment="OLLAMA_HOST=0.0.0.0" Environment="OLLAMA_KEEP_ALIVE=-1" Environment="OLLAMA_NUM_PARALLEL=1" Environment="OLLAMA_CONTEXT_LENGTH=8192" how can i reduce vram? 4x2080ti 22GB. hope all layers can run on gpus. ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-05-04 19:24:33 -05:00
Author
Owner

@pdevine commented on GitHub (Aug 6, 2025):

I think we have this fixed in 0.11.3, but we don't have a 4x2080 in the potato farm to test this. We were over allocating the amount of memory for the graph when we couldn't fit on a single device. Can you test out the pre-release? Assets are here

<!-- gh-comment-id:3157134963 --> @pdevine commented on GitHub (Aug 6, 2025): I _think_ we have this fixed in `0.11.3`, but we don't have a 4x2080 in the potato farm to test this. We were over allocating the amount of memory for the graph when we couldn't fit on a single device. Can you test out the pre-release? Assets are [here](https://github.com/ollama/ollama/releases/tag/v0.11.3-rc0)
Author
Owner

@maglat commented on GitHub (Aug 6, 2025):

I think we have this fixed in 0.11.3, but we don't have a 4x2080 in the potato farm to test this. We were over allocating the amount of memory for the graph when we couldn't fit on a single device. Can you test out the pre-release? Assets are here

0.11.3 fixed the issue for me on 20B. No I can run it with 60k context on my single RTX5090. before only 32k context was possible.
BUT this issue with that faulty memory allocation is not new to ollama since some releases. Could you please please look into Mistral-Small 3.2 as well? There the memory allocation seams off as well.

<!-- gh-comment-id:3160371551 --> @maglat commented on GitHub (Aug 6, 2025): > I _think_ we have this fixed in `0.11.3`, but we don't have a 4x2080 in the potato farm to test this. We were over allocating the amount of memory for the graph when we couldn't fit on a single device. Can you test out the pre-release? Assets are [here](https://github.com/ollama/ollama/releases/tag/v0.11.3-rc0) 0.11.3 fixed the issue for me on 20B. No I can run it with 60k context on my single RTX5090. before only 32k context was possible. BUT this issue with that faulty memory allocation is not new to ollama since some releases. Could you please please look into Mistral-Small 3.2 as well? There the memory allocation seams off as well.
Author
Owner

@HuChundong commented on GitHub (Aug 6, 2025):

使用 llama.cpp 测试,可以明确 4张2080ti完全可以加载gpt-oss:120. 稍后测试ollama

<!-- gh-comment-id:3160393958 --> @HuChundong commented on GitHub (Aug 6, 2025): 使用 llama.cpp 测试,可以明确 4张2080ti完全可以加载gpt-oss:120. 稍后测试ollama
Author
Owner

@HuChundong commented on GitHub (Aug 6, 2025):

0.11.3 is okl, but start very very slow, more then 3 minutes.
ollama ps:
gpt-oss:120b d98fe6ba01e6 86 GB 100% GPU 8192 Forever

<!-- gh-comment-id:3160830713 --> @HuChundong commented on GitHub (Aug 6, 2025): 0.11.3 is okl, but start very very slow, more then 3 minutes. ollama ps: gpt-oss:120b d98fe6ba01e6 86 GB 100% GPU 8192 Forever
Author
Owner

@HuChundong commented on GitHub (Aug 6, 2025):

speed is slower than llama.cpp.
ollama avg =32
llama.cpp avg = 40
both 8k ctx

<!-- gh-comment-id:3160866408 --> @HuChundong commented on GitHub (Aug 6, 2025): speed is slower than llama.cpp. ollama avg =32 llama.cpp avg = 40 both 8k ctx
Author
Owner

@pdevine commented on GitHub (Aug 6, 2025):

@HuChundong we've got a number of fixes coming to improve the tps.

@maglat #11090 has the memory size fixes in it which should improve all models running on the new engine.

<!-- gh-comment-id:3161708129 --> @pdevine commented on GitHub (Aug 6, 2025): @HuChundong we've got a number of fixes coming to improve the tps. @maglat #11090 has the memory size fixes in it which should improve all models running on the new engine.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69804