Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Python's `__hash__` used primarily for hash map bucketing. In that context, the identity of an integer is a perfectly cromulent bucketing value.


If your keys are multiples of the table size then all your keys will hash to the same bucket, in a naive hash table at least.


More explicitly, since buckets in Python's Set implementation are the bottom N bits of the hash (where N depends on set size), this ensures linear sequences of ints (which are the most common type of int set) are all in different buckets, so zero collisions. This is the case except when the differences between integers in the set are multiples of powers of 2. But even then, the way Set is implemented, you don't see perf start to regress until like 2**20 or so. If everything in the set is multiples of 2**20, then they start to underperform better-distributed hashes. But it never gets more than like twice as slow.

So, it ends up being a matter of optimizing for the common case, but still being reasonably performant for the worst case.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: