(from https://python-fiddle.com/saved/lN9D7DOQl0bQ8sqnDA94?code=true&output=true...

wenc · on Feb 10, 2024

> come up with the most efficient data structure to store highly sparse data (say vector), that will need to be access only sequentially.

There's something even more efficient -- a sparse array. I worked in the sparse linear algebra space, and you can gain a lot from sparsity. Add this to your fiddle:

    from scipy.sparse import csc_array
    sp_array = csc_array(my_list, dtype=np.int32)
    print(SZ, " values sparse array size: ", get_obj_size(sp_array))

And you'll get this result:

    10000  values dict size:  501060
    10000  values list size:  426620
    10000  values array size:  80056
    10000  values sparse array size:  80276

EDIT: looks like I was wrong -- I made a mistake in the code. The sparse structure is actually larger.

Also in my earlier result of 280 bytes, the get_obj_size might be reading the metadata part of the data structure. 10k int32 objects (4 bytes) each will not compress to 280 bytes.

But my point in general holds -- sparse structures are usually more efficient to work with than dense structures, especially when you have really large matrices.

FabHK · on Feb 10, 2024

Sorry, you have a sorted list of 10,000 Int32, so each of them needs around 32 bits, for a total of 40 KB. I can see how you can store them in 80 KB. I could also see how you would only store the distance to the next one (say 32-log_2(10,0000) = 20 bits on average), for a total of 200,000 bits or 25 KB, if you manage to package it very efficiently.

But how do you store that info in 264 bytes? Something is off there.

wenc · on Feb 10, 2024

Thanks for pointing that out. It was my mistake (corrected in parent post)

anonymoushn · on Feb 10, 2024

You put all the other bytes outside of python's heap :)

yalok · on Feb 10, 2024

To validate your point, I think you need to modify my test and create a 1M elements array with 99% of zeros - then your sparse class will be way more efficient, also hiding the complexity of creating a compact sparse representation at the native level.