Steps to build a fast, highly adaptive AVX-512 sorting algorithm in C: - Clone f...

gavinray · on June 10, 2023

If you look at the history of the author's own sorting implementation (ipnsort), it started as an attempt to port fluxsort to Rust:

https://github.com/Voultapher/Presentations/blob/master/rust...

https://www.youtube.com/watch?v=3JZAQ4Gsl-g

It starts there, eventually the author abandons the linked PR because of a better approach found with ipnsort:

https://github.com/rust-lang/rust/pull/100856#issuecomment-1...

Voultapher · on June 10, 2023

You are right the initial kick-off point was an attempt to port fluxsort for a stable sort in Rust, but that quickly turned out to be unfeasible because Rust implementations have way higher requirements in terms of safety. The user can modify values during a comparison operation, leave the logic at any comparison point via an panic (exception) and the comparison function may not be a total order. Together these effects make most of the code in fluxsort useless for Rust. I had been working on ipn_stable and ipn_unstable to completely different implementations. ipnsort, previously ipn_unstable started off with the pdqsort port that is the current Rust `slice::sort_unstable` and from there on I iterated and tried out different ideas.

mlochbaum · on June 10, 2023

Well, I know what you mean but "completely different" is potentially misleading here. The current ipnsort is using bidirectional merges developed for quadsort (the merging part of fluxsort) and the fulcrum partition from crumsort, also by Scandum (all credited in comments; check the source if you want to see more influences!). To balance things out, the strategy for using the namesake sorting networks is new to me: pick a few fixed sizes and handle the rest by rounding down then a few steps of insertion sorting.

Voultapher · on June 10, 2023

I should clarify, I meant that ipn_stable and ipn_unstable were separate projects, albeit with some cross-over ideas.

I can't rule out that someone else had that idea for limiting the use of sorting-networks, all I can say is that I came up with that myself. At the same time a lot of the good ideas in ipnsort are ideas from other people, and I try my best to accredit them. As I commented before, my goal is not peak perf for integers at any cost. Here is a commit https://github.com/Voultapher/sort-research-rs/commit/d908fe... that improves perf by 7% across sizes for the random pattern. But that comes with a non negligible binary size and compile time impact even for integers. And while I use heuristics to only use sorting-networks where sensible, they can guess wrong. This is a generic sort implementation with the goal of being a good all-round implementation fit for a standard library and those various uses.

scandum · on June 17, 2023

"pick a few fixed sizes and handle the rest by rounding down then a few steps of insertion sorting."

I'm late to the party, but this sounds a lot like quadsort's small array handling:

Sort 4, 8, or 16 elements using unrolled parity merges, and handle the rest with insertion sorting.

An unrolled parity merge can be viewed as a stable sorting network. I never added unstable sorting networks to crumsort due to wanting to keep the code size low, and perhaps the mistaken idea that it would reduce adaptivity, as crumsort is likely to scramble partially sorted input.

mlochbaum · on June 10, 2023

I'm not talking about ipnsort, which I think is great. It makes good use of existing work and doesn't use AVX at all. So extending it with AVX-512 partition code would also be a good project!

janwas · on June 12, 2023

Sharing some personal thoughts:

Publishing a paper can be an imperfect, but useful 'expensive signal' that increases visibility. For vqsort, we initiated a collaboration with two universities who had published papers: KIT's ips4o for parallel sorting; Jena's fast AVX2 partition and 2D sorting network.

It needn't be AVX-512, but I'd consider AVX2 to be table stakes for any sorting algorithm. It's not clear that scalar techniques transfer to vector approaches (for example, multi-pivot partitioning did not help), and the speedup from AVX2 vs scalar is at least 3x.

It's not clear to me that special-casing already-sorted inputs is a net positive in practice outside of benchmarks. Sure, we can early-out as soon as a difference is found, but we're still paying for potentially an entire scan through the array, or a slowdown in partitioning (we found that even tracking the min/max had a noticeable cost). This can instead be done at the level of callers, who can then remember that this dataset is sorted and even skip the is-sorted check subsequently.

In-place seems helpful for benefiting from limited-size but faster memory such as L3 or even HBM.

Finally, why C? As you mention, AVX-512 is not ubiquitous, and we'd prefer portability over duplicating several thousand LOC for each instruction set. That's much easier (via Highway) in C++. Would a C-compatible FFI be enough?

mlochbaum · on June 13, 2023

pdqsort paper: https://arxiv.org/abs/2106.05123

fluxsort compiles in C++ just fine.

Pre-sorted data is just the simplest example of adaptivity. Partially-sorted input can't be tracked easily by the caller and I believe that something like merging a few sorted arrays by concatenating then sorting is pretty common. Fluxsort, glidesort, ipnsort, vergesort all have methods that allow for merging such patterns quickly. According to the benchmarks in the article, many are also doing better at low-cardinality data.

If you don't consider any of this adaptivity useful, it's very strange to compare to std::sort which does make some attempt to be adaptive. And strange that your documentation wouldn't mention the difference. But we all know why you're doing that. I suppose you'll keep your fingers in your ears regarding 32-bit radix sort, which has benchmarked at 1450MB/s for a million elements (2.77 ns/v) on M1: https://github.com/mlochbaum/rhsort/issues/2#issuecomment-15... .

The fulcrum partition from crumsort is an in-place adaptation of fluxsort's out-of-place partition, which is why I mentioned it. It's faster when things stop fitting in L2 in my testing.

janwas · on June 13, 2023

> According to the benchmarks in the article, many are also doing better at low-cardinality data.

VQSort handles low cardinality quite well. Unfortunately only a few benchmarks in the article (not including that one) were re-run after our performance bug fix.

> it's very strange to compare to std::sort which does make some attempt to be adaptive.

We chose std::sort mainly because it seems to be widely used.

> But we all know why you're doing that. I suppose you'll keep your fingers in your ears regarding 32-bit radix sort

huh, I'm curious what's the background or context for this remark? FYI I published a fast 32-bit radix sort in 2010: https://arxiv.org/abs/1008.2849 The trend has been towards less and less memory bandwidth per core, and we are often seeing servers bottlenecked on that.

moonchild · on June 10, 2023

Avx512 is also good for merging and pattern detection. Merging in the case of a large input (rather than just opportunistic for presorted subsequences) is arguably the most interesting part—a big queueing and data movement problem.