Often you can sort a u128 (or smaller integer) that is actually the key and a pointer or the key and an index into an array. You may find that a single pass to rearrange the original array to the sorted order is slower than multiple passes because of cache issues. I would certainly be interested in reading about this sort of thing if you find anything :)