Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

> The downside here, then is that if -1 and -2 are used as dict keys, they'll end up bucketed together.

Note that they'll only end up bucketed together in the underlying hashmap. The dictionary will still disambiguate the keys.

    >>> a = -1
    >>> b = -2
    >>> {a: a, b: b}
    {-1: -1, -2: -2}


Yep, absolutely. End programmers will never see that unless they go digging into the CPython code. It's otherwise invisible.


Not only bucketed together, which means they could still be disambiguated in C by the full hash, but they'll actually share the full hash, which means they'd have to jump back into Python code to check `__eq__` to disambiguate IIUC. That seems like a huge perf issue, is it not?


Without looking it up, I think that's right, but... that's just kind of inevitable with hash tables. Unless you have a small number of keys and get very lucky, it's exceedingly likely that you'll have multiple keys in the same bucket, and you'll have to use whatever language's `__eq__` equivalent mechanism.

A mitigating factor is that dict keys have to be hashable, which implies immutability, which generally implies a small number of native, fast types like str, numbers, or simple data structures like tuples. You could have some frozendict monstrosity as keys, but 1) that's not something you see often, and 2) don't do that. It you must, define a fast __eq__. For example, an object mapping to a database row might look like:

  def __eq__(self, other):
      if self.pk != other.pk:
          return False # Fail quickly
      # If that's true, only then do a deep comparison
      return self.tuple_of_values == other.tuple_of_values
But again, that's just not really something that's done.


Hash *keys* are 8 bytes (assume 64-bit machine). But the Set *buckets*[1] are a masked subset of that, starting at 3 bits and growing with the size of the set. Meaning if you store 16, 8, 32, 24, 0 all in the same set, they'll be in the same bucket (0b000)*. But, they'll still have different hash *keys*, which are stored in the hash table along with the object id. Then if you look up whether 40 is in that set, the C code will look in 0b000 bucket, see the other five values there, but note that each has a different hash *key*, since those are also stored in the table. So it'll know that 40 is not in the set without having to jump up into `__eq__`**.

[1] https://github.com/python/cpython/blob/main/Objects/setobjec...

* Note that the actual implementation will place bucket conflicts in the next available bucket, not the typical "overflow" linked list from CS 102, and does lookups in the same fashion. This approach makes better use of CPU caches.

** For `int` specifically, it's a C class and has no `__eq__` method. But the idea is the same if you have a class with member `i:int` and `__hash__(self): self.i`.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: