Hash tables are nice & all, and it's very convenient to use them, BUT I really r...

iamflimflam1 · on Feb 10, 2024

The trap with asking these kind of questions is that you have spent a long time thinking of your optimum solution and the arguments for and against it.

You now have an expectation that it should be obvious and easy.

kolinko · on Feb 10, 2024

This solution is/should be obvious and easy to anyone who knows algorithms and data structures.

I didn't invent op's question, but an answer that it should be a list is obvious right after reading the question. The question is very similar to the most basic exercises from the very first lessons of any basic algorithmics course.

If I was programming in python, I'd still probably use a hashmap because it's the quickest to implement. But once it shows to be a bottleneck in terms of speed or memory use, I'd switch to lists.

NotSammyHagar · on Feb 13, 2024

I was thinking some simple structure to deal with repeated values when I saw the 10k / a million are different than, but I feel the person you are replying too is making a universal point that so many interviewers are convinced their problem is obvious and any decent dev should do it "their way". The data struct also should come with some use case about it, what else is it doing than 'storing' and enumerating results.

im3w1l · on Feb 10, 2024

Two lists one with keys and with values. The key list can be delta encoded and then compressed with the assumption that most of the deltas are small (even with adversarial keys this property must hold, as if there are too many big deltas then it would overflow the key type but we know all the keys are valid).

yalok · on Feb 10, 2024

(from https://python-fiddle.com/saved/lN9D7DOQl0bQ8sqnDA94?code=tr...)

wenc · on Feb 10, 2024

> come up with the most efficient data structure to store highly sparse data (say vector), that will need to be access only sequentially.

There's something even more efficient -- a sparse array. I worked in the sparse linear algebra space, and you can gain a lot from sparsity. Add this to your fiddle:

    from scipy.sparse import csc_array
    sp_array = csc_array(my_list, dtype=np.int32)
    print(SZ, " values sparse array size: ", get_obj_size(sp_array))

And you'll get this result:

    10000  values dict size:  501060
    10000  values list size:  426620
    10000  values array size:  80056
    10000  values sparse array size:  80276

EDIT: looks like I was wrong -- I made a mistake in the code. The sparse structure is actually larger.

Also in my earlier result of 280 bytes, the get_obj_size might be reading the metadata part of the data structure. 10k int32 objects (4 bytes) each will not compress to 280 bytes.

But my point in general holds -- sparse structures are usually more efficient to work with than dense structures, especially when you have really large matrices.

FabHK · on Feb 10, 2024

Sorry, you have a sorted list of 10,000 Int32, so each of them needs around 32 bits, for a total of 40 KB. I can see how you can store them in 80 KB. I could also see how you would only store the distance to the next one (say 32-log_2(10,0000) = 20 bits on average), for a total of 200,000 bits or 25 KB, if you manage to package it very efficiently.

But how do you store that info in 264 bytes? Something is off there.

wenc · on Feb 10, 2024

Thanks for pointing that out. It was my mistake (corrected in parent post)

anonymoushn · on Feb 10, 2024

You put all the other bytes outside of python's heap :)

yalok · on Feb 10, 2024

To validate your point, I think you need to modify my test and create a 1M elements array with 99% of zeros - then your sparse class will be way more efficient, also hiding the complexity of creating a compact sparse representation at the native level.

TheDudeMan · on Feb 10, 2024

OK but what is your point?

yalok · on Feb 10, 2024

That junior engineers tend to use hash map in situations where it’s way less efficient and is not necessary…