[GH-ISSUE #10181] NewLlamaServer failed - model requires more system memory for gemma3:12b #32440

Closed
opened 2026-04-22 13:44:03 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @metal3d on GitHub (Apr 8, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10181

What is the issue?

Hello,

Using 4 GTX 1070 cards on a rig with 8Go VRAM each, gema3:12b says that I've not enough of memory. While it works on my personal computer with a RTX 3090 with 24Go VRAM (same distribution, Fedora 41, up to date).

I use podman to launch the container, no problem for many others model like mistral. Both machines have the same service, same distribution, only the cards are differents

It's weird that the model asks for 55Go...

Also, Deepseek-R1 is completly offloaded on CPU/RAM and doesn't use any GPU, while it is OK on my personnal computer.

Relevant log output

time=2025-04-08T13:25:12.277Z level=INFO source=types.go:130 msg="inference compute" id=GPU-e819d891-c990-0584-e0d8-41429af4cbe4 library=cuda variant=v12 compute=6.1 driver=12.8 name="NVIDIA GeForce GTX 1070 Ti" total="7.9 GiB" available="7.7 GiB"
time=2025-04-08T13:25:12.277Z level=INFO source=types.go:130 msg="inference compute" id=GPU-f81dc979-ed5b-0f01-06c2-f84499b1ce23 library=cuda variant=v12 compute=6.1 driver=12.8 name="NVIDIA GeForce GTX 1070 Ti" total="7.9 GiB" available="7.8 GiB"
time=2025-04-08T13:25:12.277Z level=INFO source=types.go:130 msg="inference compute" id=GPU-c9558a4f-b11e-86e0-72e0-27a5b69a43b9 library=cuda variant=v12 compute=6.1 driver=12.8 name="NVIDIA GeForce GTX 1070 Ti" total="7.9 GiB" available="7.8 GiB"
time=2025-04-08T13:25:12.277Z level=INFO source=types.go:130 msg="inference compute" id=GPU-75b67a11-f677-4b16-0606-3893f7128daf library=cuda variant=v12 compute=6.1 driver=12.8 name="NVIDIA GeForce GTX 1070 Ti" total="7.9 GiB" available="7.8 GiB"
[GIN] 2025/04/08 - 13:28:07 | 200 |     917.493µs |       10.89.0.8 | GET      "/api/tags"
[GIN] 2025/04/08 - 13:28:07 | 200 |      95.343µs |       10.89.0.8 | GET      "/api/version"
[GIN] 2025/04/08 - 13:58:32 | 200 |    2.885402ms |       10.89.0.8 | GET      "/api/tags"
[GIN] 2025/04/08 - 13:58:33 | 200 |      66.771µs |       10.89.0.8 | GET      "/api/version"
[GIN] 2025/04/08 - 14:01:53 | 200 |    2.932734ms |       10.89.0.8 | GET      "/api/tags"
[GIN] 2025/04/08 - 14:01:53 | 200 |     117.932µs |       10.89.0.8 | GET      "/api/version"
time=2025-04-08T14:02:16.886Z level=INFO source=server.go:105 msg="system memory" total="31.3 GiB" free="28.4 GiB" free_swap="8.0 GiB"
time=2025-04-08T14:02:16.889Z level=WARN source=server.go:133 msg="model request too large for system" requested="55.2 GiB" available=39117021184 total="31.3 GiB" free="28.4 GiB" swap="8.0 GiB"
time=2025-04-08T14:02:16.889Z level=INFO source=sched.go:430 msg="NewLlamaServer failed" model=/opt/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de error="model requires more system memory (55.2 GiB) than is available (36.4 GiB)"
[GIN] 2025/04/08 - 14:02:16 | 500 |  1.043771754s |       10.89.0.8 | POST     "/api/chat"

OS

Fedora 41, using podman

GPU

4x GTX 1070 with 8Go VRAM

Ollama version

0.6.5

Originally created by @metal3d on GitHub (Apr 8, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10181 ### What is the issue? Hello, Using 4 GTX 1070 cards on a rig with 8Go VRAM each, gema3:12b says that I've not enough of memory. While it works on my personal computer with a RTX 3090 with 24Go VRAM (same distribution, Fedora 41, up to date). I use podman to launch the container, no problem for many others model like mistral. Both machines have the same service, same distribution, only the cards are differents It's weird that the model asks for 55Go... Also, Deepseek-R1 is completly offloaded on CPU/RAM and doesn't use any GPU, while it is OK on my personnal computer. ### Relevant log output ```shell time=2025-04-08T13:25:12.277Z level=INFO source=types.go:130 msg="inference compute" id=GPU-e819d891-c990-0584-e0d8-41429af4cbe4 library=cuda variant=v12 compute=6.1 driver=12.8 name="NVIDIA GeForce GTX 1070 Ti" total="7.9 GiB" available="7.7 GiB" time=2025-04-08T13:25:12.277Z level=INFO source=types.go:130 msg="inference compute" id=GPU-f81dc979-ed5b-0f01-06c2-f84499b1ce23 library=cuda variant=v12 compute=6.1 driver=12.8 name="NVIDIA GeForce GTX 1070 Ti" total="7.9 GiB" available="7.8 GiB" time=2025-04-08T13:25:12.277Z level=INFO source=types.go:130 msg="inference compute" id=GPU-c9558a4f-b11e-86e0-72e0-27a5b69a43b9 library=cuda variant=v12 compute=6.1 driver=12.8 name="NVIDIA GeForce GTX 1070 Ti" total="7.9 GiB" available="7.8 GiB" time=2025-04-08T13:25:12.277Z level=INFO source=types.go:130 msg="inference compute" id=GPU-75b67a11-f677-4b16-0606-3893f7128daf library=cuda variant=v12 compute=6.1 driver=12.8 name="NVIDIA GeForce GTX 1070 Ti" total="7.9 GiB" available="7.8 GiB" [GIN] 2025/04/08 - 13:28:07 | 200 | 917.493µs | 10.89.0.8 | GET "/api/tags" [GIN] 2025/04/08 - 13:28:07 | 200 | 95.343µs | 10.89.0.8 | GET "/api/version" [GIN] 2025/04/08 - 13:58:32 | 200 | 2.885402ms | 10.89.0.8 | GET "/api/tags" [GIN] 2025/04/08 - 13:58:33 | 200 | 66.771µs | 10.89.0.8 | GET "/api/version" [GIN] 2025/04/08 - 14:01:53 | 200 | 2.932734ms | 10.89.0.8 | GET "/api/tags" [GIN] 2025/04/08 - 14:01:53 | 200 | 117.932µs | 10.89.0.8 | GET "/api/version" time=2025-04-08T14:02:16.886Z level=INFO source=server.go:105 msg="system memory" total="31.3 GiB" free="28.4 GiB" free_swap="8.0 GiB" time=2025-04-08T14:02:16.889Z level=WARN source=server.go:133 msg="model request too large for system" requested="55.2 GiB" available=39117021184 total="31.3 GiB" free="28.4 GiB" swap="8.0 GiB" time=2025-04-08T14:02:16.889Z level=INFO source=sched.go:430 msg="NewLlamaServer failed" model=/opt/models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de error="model requires more system memory (55.2 GiB) than is available (36.4 GiB)" [GIN] 2025/04/08 - 14:02:16 | 500 | 1.043771754s | 10.89.0.8 | POST "/api/chat" ``` ### OS Fedora 41, using podman ### GPU 4x GTX 1070 with 8Go VRAM ### Ollama version 0.6.5
GiteaMirror added the bug label 2026-04-22 13:44:03 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 8, 2025):

When a model is shared across multiple devices, the amount of overhead goes up. That is, a model loaded into a GPU consists of weights, context buffer, computation graph, projector data structures, etc. Some of those allocations need to be replicated across all devices, so multiple copies of those allocations increases the overall VRAM requirement.

<!-- gh-comment-id:2786627031 --> @rick-github commented on GitHub (Apr 8, 2025): When a model is shared across multiple devices, the amount of overhead goes up. That is, a model loaded into a GPU consists of weights, context buffer, computation graph, projector data structures, etc. Some of those allocations need to be replicated across all devices, so multiple copies of those allocations increases the overall VRAM requirement.
Author
Owner

@metal3d commented on GitHub (Apr 8, 2025):

@rick-github that is a good track.

OK I made some others tests, and I think that something is not clear for me.

  • Using "ollama run gemma3:12b" works, I can chat with the model and that works.
  • on "open-webui", I've set the context window to 131072 (which is the context size I see in ollama show output) and it fails with the provided error
  • if I set 8096, now it works !

What I think, @rick-github, is that the context windows cannot be filled by one GPU which is used to absorb the inputs. So, whatever the total amount of memory, it cannot work. Should I devide the ctx window size by the number of GPU ?

Thanks a lot !

EDIT: the question is "what does ollama run to make it work, while open-webui fails? Is ollama calculateing the context window size beeing allocated?"

<!-- gh-comment-id:2786644511 --> @metal3d commented on GitHub (Apr 8, 2025): @rick-github that is a good track. OK I made some others tests, and I think that something is not clear for me. - Using "ollama run gemma3:12b" works, I can chat with the model and that works. - on "open-webui", I've set the context window to 131072 (which is the context size I see in ollama show output) and it fails with the provided error - if I set 8096, now it works ! What I think, @rick-github, is that the context windows cannot be filled by one GPU which is used to absorb the inputs. So, whatever the total amount of memory, it cannot work. Should I devide the ctx window size by the number of GPU ? Thanks a lot ! EDIT: the question is "what does ollama run to make it work, while open-webui fails? Is ollama calculateing the context window size beeing allocated?"
Author
Owner
<!-- gh-comment-id:2786655925 --> @rick-github commented on GitHub (Apr 8, 2025): https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size
Author
Owner

@metal3d commented on GitHub (Apr 8, 2025):

Oh, it set it up to 2048 by default ! OK... So everything is now clear.

I can close the issue. Sorry for that. And thanks a lot.

<!-- gh-comment-id:2786665963 --> @metal3d commented on GitHub (Apr 8, 2025): Oh, it set it up to 2048 by default ! OK... So everything is now clear. I can close the issue. Sorry for that. And thanks a lot.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32440