Hacker Timesnew | past | comments | ask | show | jobs | submitlogin
Scientists find optimal space-time balance for hash tables (quantamagazine.org)
189 points by Brajeshwar on Feb 9, 2024 | hide | past | favorite | 63 comments


I'm a bit sad to see the anti-curious commentary here. This result is really cool, and finds the asymptotic sweet spot for hash table memory usage. Often times, the first algorithm to establish an asymptotic limit is entirely too complicated and doesn't help at human scale. If the world collectively ignores such results, as many here seem inclined to do, that's the end of the story. But when somebody continues bashing their head into this wall, sometimes a really good algorithm falls out, which hits that asymptotic performance benefit at a human-useful scale.

The thing I'm curious about (as I haven't had time to cozy up with the paper, and I'm about to run) is how they hit that runtime. What's the big idea? Or is it just the gestalt of a dozen slightly clever ideas?


The anti-curious commentary strikes me as being analogous to meth cooks complaining about https://en.wikipedia.org/wiki/High-pressure_chemistry research results not being applicable to their (1 bar) "real world" problems.

(Edit: it would appear that meth cooks are more appreciative of theoretical advances than programmers; recent european lab busts have revealed synthesis pathways using [similar to Haber-Bosch and Bergius] hundreds of atmospheres of pressure)


>Often times, the first algorithm to establish an asymptotic limit is entirely too complicated and doesn't help at human scale

https://en.wikipedia.org/wiki/Galactic_algorithm


I expected some cool graphics or livehacks but I see photos of researchers and it is not obvious what link I should follow to move from promotional content to more scientifical one.


How about searching for it in Google, instead of clicking a link?


For anyone curious this is about a 2023 paper proving that a 2022 theoretical hash table construction achieved not only an upper bound in performance in terms of time and space, the construction also achieved a lower bound. It’s a theoretical construction so not actually a practical one you’d find in your software toolkit (constants are too large for most applications, likely even realistic databases). The referenced “practical” IcebergHT data structure isn’t even one you’d likely encounter as it seems to be a research hash table focusing on persistent memory which isn’t hardware you typically would encounter.


IcebergHT isn't just for persistent memory (although I can see why you might think it is based on the paper's title). The paper also gives experiments showing that the hash table performs very well in standard RAM, much better than other concurrent implementations.


But the other implementations were against other PMEM hash tables unless I misread? Like no comparison against common c++ or rust implementations.


I think you may have a backwards. Libcuckoo, CLHT, and TBB are widely used high performance C/C++ DRAM hash tables. I think TBB is the hash table Intel maintains, if I remember right.

So the DRAM experiments are apples to apples. It's actually the PMEM experiments, I think, that are comparing a new hash table on a new technology to previous hash tables that weren't designed specifically for that technology.


You’re right. I didn’t see later in the paper where they compare against TBB.


This is so strange to me.

I'm hoping because Quanta's explanation was approachable, but ultimately, wrong when I try applying it the following way:

Theorists will spend an enormous amount of time developing algorithms that are O(kNlog(N)) that are impractical in practice because

- K approximates infinity

- It is well-known K approximates infinity.

- It is not expected for K to decrease.


> “But in practice, constants really matter,” he said. “In the real world, a factor of 10 is a game ender.”

Working at a place who kept losing customers to a competitor whose software was less than 2x as fast as ours fundamentally changed how I view optimization and how I view constant overhead C. And crystalized once I saw how delivering steady gains milestone after milestone can buy a lot more goodwill than one fast and dirty optimization.

Speed doesn't matter if you're the only game in town (a monopoly). For everything else it matters.


love when things like this are ready and waiting on the shelf for when we need them


Kinda but often times such research isn’t even “waiting on the shelf”. You can see that in the Iceberg paper which describes how earlier constructions optimized for PMEM Hw did a bad job because they were built against simulated hardware. In practice you need to actually iterate with real designs.

I doubt such designs will find practical uses (iceberg maybe but the pure math designs seem unlikely).


In 200 years this is going to be like when we discover yet another proof of Euler or Gauss, 20 years after a current mathematician arrived at the same result.


Or it could be like multipoint touch screens, early version of which were a solved problem in 1980s or earlier, and just needed compute power to catch up, which it did over subsequent decades.

Or it could be like code=data and homoiconity and other CS fundamentals that were figured out 40+ years ago, but are still mostly ignored by software industry/culture.


Hash tables are nice & all, and it's very convenient to use them, BUT I really regret how they create propensity for junior developers to use them even in cases where they are not really needed and more memory-efficient (and perf too) data structure could be used.

Anecdotally, in my recent year of interviewing new candidates, I keep asking this simple question: come up with the most efficient data structure to store highly sparse data (say vector), that will need to be access only sequentially. E.g., we have a 1M int32 values vector, and only 10K values are non-zero.

90% of the candidates suggest using the hash table (especially if they prefer to use Python for the interview).

I then ask them to estimate (roughly) the size of the memory, required to store those 10K values.

Some say they need 10KB (they struggle to convert int32 to the count of bytes as well ;( ). Some say 40KB (a bit better, but they forget about keys). Less than 20% arrive to 80KB. Very few suggest that it's something higher than 80KB...

Most struggle to account for the pointers that inevitably should be there if values are not allocated in contiguous memory. Most forget about the hashmap itself.

Fwiw, here's some primitive comparison of memory taken by Python's dict, list and numpy's array:

10000 values dict size: 500568

10000 values list size: 426516

10000 values array size: 80056

500KB/80KB - > 6x times overhead...


The trap with asking these kind of questions is that you have spent a long time thinking of your optimum solution and the arguments for and against it.

You now have an expectation that it should be obvious and easy.


This solution is/should be obvious and easy to anyone who knows algorithms and data structures.

I didn't invent op's question, but an answer that it should be a list is obvious right after reading the question. The question is very similar to the most basic exercises from the very first lessons of any basic algorithmics course.

If I was programming in python, I'd still probably use a hashmap because it's the quickest to implement. But once it shows to be a bottleneck in terms of speed or memory use, I'd switch to lists.


I was thinking some simple structure to deal with repeated values when I saw the 10k / a million are different than, but I feel the person you are replying too is making a universal point that so many interviewers are convinced their problem is obvious and any decent dev should do it "their way". The data struct also should come with some use case about it, what else is it doing than 'storing' and enumerating results.


Two lists one with keys and with values. The key list can be delta encoded and then compressed with the assumption that most of the deltas are small (even with adversarial keys this property must hold, as if there are too many big deltas then it would overflow the key type but we know all the keys are valid).



> come up with the most efficient data structure to store highly sparse data (say vector), that will need to be access only sequentially.

There's something even more efficient -- a sparse array. I worked in the sparse linear algebra space, and you can gain a lot from sparsity. Add this to your fiddle:

    from scipy.sparse import csc_array
    sp_array = csc_array(my_list, dtype=np.int32)
    print(SZ, " values sparse array size: ", get_obj_size(sp_array))
And you'll get this result:

    10000  values dict size:  501060
    10000  values list size:  426620
    10000  values array size:  80056
    10000  values sparse array size:  80276
EDIT: looks like I was wrong -- I made a mistake in the code. The sparse structure is actually larger.

Also in my earlier result of 280 bytes, the get_obj_size might be reading the metadata part of the data structure. 10k int32 objects (4 bytes) each will not compress to 280 bytes.

But my point in general holds -- sparse structures are usually more efficient to work with than dense structures, especially when you have really large matrices.


Sorry, you have a sorted list of 10,000 Int32, so each of them needs around 32 bits, for a total of 40 KB. I can see how you can store them in 80 KB. I could also see how you would only store the distance to the next one (say 32-log_2(10,0000) = 20 bits on average), for a total of 200,000 bits or 25 KB, if you manage to package it very efficiently.

But how do you store that info in 264 bytes? Something is off there.


Thanks for pointing that out. It was my mistake (corrected in parent post)


You put all the other bytes outside of python's heap :)


To validate your point, I think you need to modify my test and create a 1M elements array with 99% of zeros - then your sparse class will be way more efficient, also hiding the complexity of creating a compact sparse representation at the native level.


OK but what is your point?


That junior engineers tend to use hash map in situations where it’s way less efficient and is not necessary…


Somewhat off-topic but does anyone have a bookmarklet that would remove annoying scrolling behavior on sites like Quanta? I.e. pressing space bar should immediately scroll down a page, not do it slowly with awkward mechanics.

In exchange I'm happy to share a bookmarklet that "unsticks" sticky elements of the page (e.g. an ever-visible header like on WaPo, though they are not the worst offender): https://pastebin.com/Vh594168


I think they're using sscroll from npm, but not entirely sure. I was able to fix it, but too well, in that now it'll scroll with unbounded speed: https://tio.run/##LY5BDsIgEEX3PcXsgKQS9wYXRpMu3HkCAlMdbYFQ2q...


This is quick and dirty but it works on this particular site:

javascript:void(addEventListener('keydown',e=>e.keyCode==32&&e.stopPropagation(),true)


...spacebar?

shouldn't "down a page" be done with PageDown?

who came up with binding this behavior to a spacebar as well??


>shouldn't "down a page" be done with PageDown?

PageDown doesn't reliably exist - plenty of laptops put it behind a function-key combo, making the only action you're doing on the page a two-handed affair.

> who came up with binding this behavior to a spacebar as well??

Apparently, it comes from the `more` command[1], because they couldn't reliably assume terminals would have a PageDown key, and because spacebar was the biggest key on the keyboard so it was the obvious choice for the one action in `more`.

[1] https://ux.stackexchange.com/questions/53110/why-does-the-sp...


Spacebar is next page is old-as-ages terminal convention.

But it has no place in a browser.


Someone who realized that page down is far away, space is close, and paging down is frequent


There is no PageDown on my MacBook keyboard.


Fn-down gives you page down. Also:

Fn-up: page up

Fn-right: end

Fn-left: home


Thanks. Yeah, I forgot. Fn-up and Fn-left are particularly useful.


This kind of thing is unfortunately not useful at any scale.

Real computer systems have performance that varies due to memory locality and size.

Locality because of things such as cache hierarchy and even the size of cache lines and memory pages.

Size because of physical implementation: larger memories are physically bigger and hence further away. The speed of light makes access to larger memories inescapably slower.

The best current hashtable implementations are all quite far from the purely theoretical computer science optimums, but are faster despite this because they take these factors into account.

Back when CPUs were simple and had no virtual memory or caches, there was a good correspondence between theoretical CS algorithms and their real implementations.

Now? Everything I see published in this space is basically pure maths with little or no practical utility. It’s still interesting, sure, but it’s a bit sad that the theorists have retreated into a virtual world to escape the messy details of our reality.


This is a bit pessimistic.

Most STOC papers don't directly give an incredibly practical algorithm. But what they do give is insight, and that insight can often be used to improve the practical state of the art. And new bounds get you thinking; there's power in just knowing we can do better. It's like breaking the 4 minute mile -- a lot follows.

From a personal perspective: I quite like some of the techniques they use in the STOC'22 paper, and I have some ideas about how to turn that into a practical improvement for some of my data structures, which _do_ take into account CPU and cache because I'm that kind of geek. (To save you some googling, I'm one of the co-creators of the cuckoo filter, and of the techniques that underlie many practical implementations of cuckoo hashing, particularly those with concurrent access, and some other stuff. A lot of what I do in practical data structures-meets-systems comes from seeing what's happening in theory and finding ways to apply it, even though in doing so I often give up on the optimality bounds, the insights that come from the theory folks are the basis of it.)

Also, note that there are a few flavors of theory. A lot of US theory is quite on the math side of things, with a few exceptions (Mitz, quoted in the article, being one of them). But there's a very thriving applied algorithms community in the world, with a lot of energy in the area in both Europe and South America, and the stuff published in those venues -- again, often derived from or inspired by work such as this STOC paper -- is more immediately applicable.


Cuckoo hashing is such a joy to work with, and feels like a gem produced from the pressure at the intersection of theory and practice.

They are much easier to implement yet feel more fundamental than other open accessing shemes.

I've used tiny cuckoo hash tables with a bounded size of 2,4,...,256 elements as a compression technique for an Adaptive Radix Trie that doesn't require heterogeneous nodes, and I think I've never had more fun in my career as a computer scientist/software engineer.


Hey, if you're looking for a real-world pragmatic and performant implementation of a theoretically-cool algorithm, my https://github.com/tpn/perfecthash project might fit the bill.

It's geared to generating perfect hash tables with the fastest possible lookup/index times (for 32-bit keys), for key sets in the <=100,000 range. (It scales well up to millions of keys, but the solving time takes a lot longer.)


My hunch is that boomphf will outperform and also supports any key type: https://github.com/10XGenomics/rust-boomphf

Construction is ~10m keys/s on old hardware and uses very few bits per key.

Implementation is based on the bbhash paper: https://arxiv.org/abs/1702.03154


What’s its fastest index function look like in assembly? My MultiplyShiftRX clocks in at like 5 cycles on x64 and 3 cycles on my M1. Mine is optimized for offline table generation so construction speed isn’t really relevant for its primary use.


What is a theorist if not someone who works on interesting problems which have a tenuous-at-best relationship with reality?



For reference, I enjoy (and have formally studied) both theoretical physics and pure theoretical mathematics. I get the "value" of both, even when the value is extremely abstract and removed from practical utility.

However.

Theoretical physics especially is grounded in reality. Sure, it has its frictionless cows and whatnot, but generally a connection with the real world is maintained.[1] In other words, it's possible to take an idealised theory and then sprinkle the messy details on top, such as friction and air resistance.

What I'm seeing in computer science is different. Their theories are not the type that need a slight quantitative adjustment to match reality in the sense of adding a 1% extra fine-tuning factors, but they're qualitatively wrong. They're wrong not by constant factors or constant offsets, but big-O notation wrong. The equations have the wrong powers in them! Missing square roots or logarithms!

It's as if the working engineers had switched from Newtonian to Relativistic Mechanics because we've colonised the Solar System, but all theoretical physicists are basically pretending that only Newtonian mechanics is worthy of study and that Lorentzian mechanics is just a fudge factor that can be ignored forever and ever. Even when we all live in space with multi-hour communication delays caused by the speed of light. "Just set that to zero and then..."

[1] One notable exception to this that grinds my gears is a habit of publishing papers about results that only apply in non-real toy models, such as 2+2 dimensional spacetime, but not putting a big warning box before the abstract to warn journalists that nothing in the paper applies to our reality.


Yep, it's the same thing every time. "We propose this new thing with optimal time complexity blah blah blah" - don't care. Show benchmark. What do you mean it's slower than this basic but cache efficient hash table implemented in 100 lines of c++?


This is a totally different result. It’s providing a proof of an algorithmic complexity. So let’s say you have 10^18 items, it’s likely you’d be using this even if certain algorithms are faster at smaller numbers. It’s also important to note the particularly interesting result is how the mathematicians proved that the earlier design was optimal (upper and lower bound). The technique is the valuable part as it adds to the mathematician tool kit of how to prove such results. This article is about math papers / functional theoretic math, not about applies CS ideas.


I'm struggling to think of any possible application that would need anywhere remotely close to 10^18 items. Do you have an example?


I pulled that specific number out of the air and it was intentionally an over exaggeration. We don’t know the crossover point.


> let’s say you have 10^18 items

You won't. Not on a single node.

> The technique is the valuable part as it adds to the mathematician tool kit of how to prove such results.

Agreed.


I haven't kept up in this space, but there was a bunch of papers under "cache agnostic algorithms" that were actually taking the concept of cache into account. The cache agnostic here being regardless of cache size, sort of treating cache in the O() sense.

You could still beat them by tweaking for your hardware. But for the first time there was research into why Q sort is faster than heap sort


I think these are more-often called "cache oblivious" algorithms


Here is the code https://github.com/splatlab/iceberghashtable

(from https://prashantpandey.github.io/publication/sigmod23_iceber...)

Need to compare it against my other concurrent hash tables, as they measured only 64bit int performance for keys and values, which is a bit unrealistic. And they only used murmurhash.


What is the ratio. Tell me.


Second paper's lower bound has some strange "log(^U_n)" term. Can anyone explain what does that even mean? Logarithm of combinations, as in "log(U!/n!/(U-n)!)"?


What is an optimum in a two-objective (time, space) problem?

Also, how exactly does time help to save space? Is this about some sophisticated hash functions? Handling collisions?


optimal for which use cases?


In my opinion space limited solutions are not very interesting. Time complexity is always >= space complexity, so finding optimal time solutions is always more interesting. And we already knew the optimal time complexity. I appreciate a theoretical time vs space relation, but it doesn’t seem applicable to other areas.


For achieving speed, often you'd like to keep your hash table in RAM. Space limits result in speed trade-offs. This is very much a real-world concern when writing genome aligners. Hash-based mappers were at a disadvantage early on due to the memory constraints of most servers back around 2009-2012, thus the more efficient Burrows-Wheeler transform based algorithms became popular, partly due to significantly reduced RAM needs - one could now align genomes on a laptop!


It’s a great thing that this report addresses both space and time!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: