> Somehow, similarly bad numbers were achieved with 56 cores working with disjoint parts of 512 GB array
Oh, that's not NUMA at all, now that I'm more carefully reading your post. NUMA would involve a "copy" step, ensuring that those elements are in NUMA-local memory before reading.
Much like how in GPU-programming, you have to worry about the physicality of memory, NUMA-aware programming you have to memcpy data to the right location before it achieves high speeds. Each of the 56-cores needs its ~10GBs in "NUMA-local" memory _BEFORE_ you start the benchmark.
Yeah, I realize this isn't practical. But... who ever said that NUMA use cases are practical? Lol. A lot of cases, it makes more sense to just take advantage of infinity fabric for simplicity (although its slower, its definitely more convenient).
Oh, that's not NUMA at all, now that I'm more carefully reading your post. NUMA would involve a "copy" step, ensuring that those elements are in NUMA-local memory before reading.
Much like how in GPU-programming, you have to worry about the physicality of memory, NUMA-aware programming you have to memcpy data to the right location before it achieves high speeds. Each of the 56-cores needs its ~10GBs in "NUMA-local" memory _BEFORE_ you start the benchmark.
Yeah, I realize this isn't practical. But... who ever said that NUMA use cases are practical? Lol. A lot of cases, it makes more sense to just take advantage of infinity fabric for simplicity (although its slower, its definitely more convenient).