I was actually working on a blog post about this a few months ago, but never pub...

I was actually working on a blog post about this a few months ago, but never published it because I wasn't happy with it. Here it is, anyway: https://goo.gl/W1KZ2t

What I found in testing was that linear probing beat everything when it had a sufficiently good hash function and a sufficiently low load factor. The load factor was pretty important. Even when factoring in cache misses, page faults, and the cost of allocations, a linearly probed table with a low load factor always beat more space-efficient designs.

>I also tried to use gcc/clang vector extension to speed up probing, but it actually slows down for 30%! I guess I have to hand tune the SSE intrinsics and measure it carefully with IACA to get the optimal performance.

They're only advantageous when the hash is perfect and the data is in the cache. Otherwise the performance should be the same as regular code. To clarify what I meant by SSE with double hashing, I really meant search an entire cacheline before incrementing by the second hash value. SSE can be used for this in some cases but regular code works just as well.

>Deletion itself is quite complicated and deserves its own post.

That's why I prefer the linear variant; deletion is easy!