[GH-ISSUE #15302] Vulkan runtime, allow more than 64GB VRAM with splitted memory heaps. #35550

Closed
opened 2026-04-22 20:07:14 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @rjmalagon on GitHub (Apr 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15302

On Linux AMDGPU, when assigned more than 64 GB GTT memory, Vulkan split the memory heap. Ollama counts the assigned GTT memory pool, but gets stuck when using above the larger splitted memory heap.

In my current setup, a GFX900-level AMD APU, with 96GB of main RAM, 82GB assigned (amdgpu.gttsize=82000)

These are the reported memory heaps by vulkaninfo

=================================
memoryHeaps: count = 2
        memoryHeaps[0]:
                size   = 29376905216 (0x6d7000000) (27.36 GiB)
                budget = 29367087104 (0x6d66a3000) (27.35 GiB)
                usage  = 0 (0x00000000) (0.00 B)
                flags:
                        None
        memoryHeaps[1]:
                size   = 58753810432 (0xdae000000) (54.72 GiB)
                budget = 58734174208 (0xdacd46000) (54.70 GiB)
                usage  = 0 (0x00000000) (0.00 B)
                flags: count = 1
                        MEMORY_HEAP_DEVICE_LOCAL_BIT

This is what is detected by Ollama Vulkan runtime

time=2026-04-03T19:23:01.730Z level=INFO source=types.go:42 msg="inference compute" id=00000000-0900-0000-0000-000000000000 filter_id="" library=Vulkan compute=0.0 name=Vulkan1 description="AMD Radeon Vega 8 Graphics (RADV RAVEN)" libdirs=ollama,vulkan driver=0.0 pci_id=0000:09:00.0 type=iGPU total="82.1 GiB" available="82.1 GiB"

When filling the larger heap (above 54.7 GB in this example, Qwen3.5 27B bf16) Ollama does not crash, just gets indefinitely stuck loading the model.

time=2026-04-03T19:05:34.815Z level=INFO source=device.go:240 msg="model weights" device=Vulkan1 size="49.3 GiB"
time=2026-04-03T19:05:34.815Z level=INFO source=device.go:245 msg="model weights" device=CPU size="2.5 GiB"
time=2026-04-03T19:05:34.815Z level=INFO source=device.go:251 msg="kv cache" device=Vulkan1 size="4.7 GiB"
time=2026-04-03T19:05:34.832Z level=INFO source=device.go:262 msg="compute graph" device=Vulkan1 size="789.3 MiB"
time=2026-04-03T19:05:34.832Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="39.6 MiB"
time=2026-04-03T19:05:34.833Z level=INFO source=device.go:272 msg="total memory" size="57.2 GiB"
Originally created by @rjmalagon on GitHub (Apr 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15302 On Linux AMDGPU, when assigned more than 64 GB GTT memory, Vulkan split the memory heap. Ollama counts the assigned GTT memory pool, but gets stuck when using above the larger splitted memory heap. In my current setup, a GFX900-level AMD APU, with 96GB of main RAM, 82GB assigned (amdgpu.gttsize=82000) These are the reported memory heaps by `vulkaninfo` ```VkPhysicalDeviceMemoryProperties: ================================= memoryHeaps: count = 2 memoryHeaps[0]: size = 29376905216 (0x6d7000000) (27.36 GiB) budget = 29367087104 (0x6d66a3000) (27.35 GiB) usage = 0 (0x00000000) (0.00 B) flags: None memoryHeaps[1]: size = 58753810432 (0xdae000000) (54.72 GiB) budget = 58734174208 (0xdacd46000) (54.70 GiB) usage = 0 (0x00000000) (0.00 B) flags: count = 1 MEMORY_HEAP_DEVICE_LOCAL_BIT ``` This is what is detected by Ollama Vulkan runtime ``` time=2026-04-03T19:23:01.730Z level=INFO source=types.go:42 msg="inference compute" id=00000000-0900-0000-0000-000000000000 filter_id="" library=Vulkan compute=0.0 name=Vulkan1 description="AMD Radeon Vega 8 Graphics (RADV RAVEN)" libdirs=ollama,vulkan driver=0.0 pci_id=0000:09:00.0 type=iGPU total="82.1 GiB" available="82.1 GiB" ``` When filling the larger heap (above 54.7 GB in this example, Qwen3.5 27B bf16) Ollama does not crash, just gets indefinitely stuck loading the model. ``` time=2026-04-03T19:05:34.815Z level=INFO source=device.go:240 msg="model weights" device=Vulkan1 size="49.3 GiB" time=2026-04-03T19:05:34.815Z level=INFO source=device.go:245 msg="model weights" device=CPU size="2.5 GiB" time=2026-04-03T19:05:34.815Z level=INFO source=device.go:251 msg="kv cache" device=Vulkan1 size="4.7 GiB" time=2026-04-03T19:05:34.832Z level=INFO source=device.go:262 msg="compute graph" device=Vulkan1 size="789.3 MiB" time=2026-04-03T19:05:34.832Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="39.6 MiB" time=2026-04-03T19:05:34.833Z level=INFO source=device.go:272 msg="total memory" size="57.2 GiB" ```
GiteaMirror added the feature request label 2026-04-22 20:07:14 -05:00
Author
Owner

@rjmalagon commented on GitHub (Apr 13, 2026):

Not needed. I was missing the forest by the trees.

TLDR: on current AMDGPU Linux module, is better to set the shared by ttm.pages_limit boot kernel parameter, it will automatically set the corresponding GTT shared memory (there is no need to set amdgpu.gttsize and manually try to match TTM with the GTT).

Citing an AMD article

"TT limits are expressed in 4KB pages. To compute the value

([size in GB] * 1024 * 1024)/ 4.096

Example for 120 GB:

(120 * 1024 * 1024) / 4.096 = 3072000"

ttm.pages_limit=3072000

This allows full Vulkan access to the RAM for compute in these older AMD APUs.

<!-- gh-comment-id:4234267476 --> @rjmalagon commented on GitHub (Apr 13, 2026): Not needed. I was missing the forest by the trees. TLDR: on current AMDGPU Linux module, is better to set the shared by `ttm.pages_limit` boot kernel parameter, it will automatically set the corresponding GTT shared memory (there is no need to set `amdgpu.gttsize` and manually try to match TTM with the GTT). Citing an AMD article "TT limits are expressed in 4KB pages. To compute the value ([size in GB] * 1024 * 1024)/ 4.096 Example for 120 GB: (120 * 1024 * 1024) / 4.096 = 3072000" `ttm.pages_limit=3072000` This allows full Vulkan access to the RAM for compute in these older AMD APUs.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#35550