[GH-ISSUE #8270] Incorrect NUMA detection logic, fails for AMD Threadripper 1950X #31048

Open
opened 2026-04-22 11:10:19 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @lukedd on GitHub (Dec 30, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/8270

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

On my AMD Threadripper 1950X CPU with NUMA mode enabled in the BIOS, ollama does not detect that I am running on a NUMA system due to flawed logic in its detection code here: 459d822b51/discover/cpu_common.go (L10-L24)

I can "trick" ollama into detecting NUMA by setting up fake information in /sys/devices/system/cpu/cpu*/topology/physical_package_id using overlayfs, which gives me a ~20% speedup for CPU-only eval-rate (tested with gemma2:27b).

The problem in the logic is that it counts how many physical CPU packages are in the system, but my system has a single CPU package containing 2 dies each with their own memory controller.

A naïve fix would be to look at die_id rather than physical_package_id: this would work for me but I fear there may exist other hardware which has multiple dies sharing a single memory controller. Also even on my system I can disable NUMA in the BIOS so that memory access appears to be uniform - under the hood this interleaves memory access across both NUMA nodes. So in this mode looking at die_id would give the wrong answer.

A better fix would be to look at the actual NUMA node information presented by the kernel under /sys/devices/system/node, e.g. on my system the file /sys/devices/system/node/online contains 0-1 whereas on a uniform memory system it contains 0.

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.5.4

Originally created by @lukedd on GitHub (Dec 30, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/8270 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? On my AMD Threadripper 1950X CPU with NUMA mode enabled in the BIOS, ollama does not detect that I am running on a NUMA system due to flawed logic in its detection code here: https://github.com/ollama/ollama/blob/459d822b5188dba051e21dfd15b6552543a4bbcf/discover/cpu_common.go#L10-L24 I can "trick" ollama into detecting NUMA by setting up fake information in `/sys/devices/system/cpu/cpu*/topology/physical_package_id` using `overlayfs`, which gives me a ~20% speedup for CPU-only eval-rate (tested with gemma2:27b). The problem in the logic is that it counts how many physical CPU packages are in the system, but my system has a single CPU package containing 2 dies each with their own memory controller. A naïve fix would be to look at `die_id` rather than `physical_package_id`: this would work for me but I fear there may exist other hardware which has multiple dies sharing a single memory controller. Also even on my system I can disable NUMA in the BIOS so that memory access appears to be uniform - under the hood this interleaves memory access across both NUMA nodes. So in this mode looking at `die_id` would give the wrong answer. A better fix would be to look at the actual NUMA node information presented by the kernel under `/sys/devices/system/node`, e.g. on my system the file `/sys/devices/system/node/online` contains `0-1` whereas on a uniform memory system it contains `0`. ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.5.4
GiteaMirror added the bug label 2026-04-22 11:10:19 -05:00
Author
Owner

@jmorganca commented on GitHub (Dec 30, 2024):

@dhiltgen possible to take a look?

<!-- gh-comment-id:2565963191 --> @jmorganca commented on GitHub (Dec 30, 2024): @dhiltgen possible to take a look?
Author
Owner

@chhu commented on GitHub (Apr 25, 2025):

I'm also using Threadripper with active NUMA and see even a larger boost, almost 80% faster with llama.cpp when using the same model with --numa=distribute . Started experimenting when the numbers did not add up, a test program I wrote in c++ gives 90GByte/s so I expect a token output close to model size / realistic mem bandwidth, which llama.cpp was able to reproduce... (pure cpu)

<!-- gh-comment-id:2831287537 --> @chhu commented on GitHub (Apr 25, 2025): I'm also using Threadripper with active NUMA and see even a larger boost, almost 80% faster with llama.cpp when using the same model with --numa=distribute . Started experimenting when the numbers did not add up, a test program I wrote in c++ gives 90GByte/s so I expect a token output close to model size / realistic mem bandwidth, which llama.cpp was able to reproduce... (pure cpu)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#31048