Hash tables are nice & all, and it's very convenient to use them,
BUT I really regret how they create propensity for junior developers to use them even in cases where they are not really needed and more memory-efficient (and perf too) data structure could be used.
Anecdotally, in my recent year of interviewing new candidates, I keep asking this simple question: come up with the most efficient data structure to store highly sparse data (say vector), that will need to be access only sequentially. E.g., we have a 1M int32 values vector, and only 10K values are non-zero.
90% of the candidates suggest using the hash table (especially if they prefer to use Python for the interview).
I then ask them to estimate (roughly) the size of the memory, required to store those 10K values.
Some say they need 10KB (they struggle to convert int32 to the count of bytes as well ;( ).
Some say 40KB (a bit better, but they forget about keys).
Less than 20% arrive to 80KB.
Very few suggest that it's something higher than 80KB...
Most struggle to account for the pointers that inevitably should be there if values are not allocated in contiguous memory. Most forget about the hashmap itself.
Fwiw, here's some primitive comparison of memory taken by Python's dict, list and numpy's array:
The trap with asking these kind of questions is that you have spent a long time thinking of your optimum solution and the arguments for and against it.
You now have an expectation that it should be obvious and easy.
This solution is/should be obvious and easy to anyone who knows algorithms and data structures.
I didn't invent op's question, but an answer that it should be a list is obvious right after reading the question. The question is very similar to the most basic exercises from the very first lessons of any basic algorithmics course.
If I was programming in python, I'd still probably use a hashmap because it's the quickest to implement. But once it shows to be a bottleneck in terms of speed or memory use, I'd switch to lists.
I was thinking some simple structure to deal with repeated values when I saw the 10k / a million are different than, but I feel the person you are replying too is making a universal point that so many interviewers are convinced their problem is obvious and any decent dev should do it "their way". The data struct also should come with some use case about it, what else is it doing than 'storing' and enumerating results.
Two lists one with keys and with values. The key list can be delta encoded and then compressed with the assumption that most of the deltas are small (even with adversarial keys this property must hold, as if there are too many big deltas then it would overflow the key type but we know all the keys are valid).
> come up with the most efficient data structure to store highly sparse data (say vector), that will need to be access only sequentially.
There's something even more efficient -- a sparse array. I worked in the sparse linear algebra space, and you can gain a lot from sparsity. Add this to your fiddle:
EDIT: looks like I was wrong -- I made a mistake in the code. The sparse structure is actually larger.
Also in my earlier result of 280 bytes, the get_obj_size might be reading the metadata part of the data structure. 10k int32 objects (4 bytes) each will not compress to 280 bytes.
But my point in general holds -- sparse structures are usually more efficient to work with than dense structures, especially when you have really large matrices.
Sorry, you have a sorted list of 10,000 Int32, so each of them needs around 32 bits, for a total of 40 KB. I can see how you can store them in 80 KB. I could also see how you would only store the distance to the next one (say 32-log_2(10,0000) = 20 bits on average), for a total of 200,000 bits or 25 KB, if you manage to package it very efficiently.
But how do you store that info in 264 bytes? Something is off there.
To validate your point, I think you need to modify my test and create a 1M elements array with 99% of zeros - then your sparse class will be way more efficient, also hiding the complexity of creating a compact sparse representation at the native level.
Anecdotally, in my recent year of interviewing new candidates, I keep asking this simple question: come up with the most efficient data structure to store highly sparse data (say vector), that will need to be access only sequentially. E.g., we have a 1M int32 values vector, and only 10K values are non-zero.
90% of the candidates suggest using the hash table (especially if they prefer to use Python for the interview).
I then ask them to estimate (roughly) the size of the memory, required to store those 10K values.
Some say they need 10KB (they struggle to convert int32 to the count of bytes as well ;( ). Some say 40KB (a bit better, but they forget about keys). Less than 20% arrive to 80KB. Very few suggest that it's something higher than 80KB...
Most struggle to account for the pointers that inevitably should be there if values are not allocated in contiguous memory. Most forget about the hashmap itself.
Fwiw, here's some primitive comparison of memory taken by Python's dict, list and numpy's array:
10000 values dict size: 500568
10000 values list size: 426516
10000 values array size: 80056
500KB/80KB - > 6x times overhead...