Hacker Timesnew | past | comments | ask | show | jobs | submitlogin



> come up with the most efficient data structure to store highly sparse data (say vector), that will need to be access only sequentially.

There's something even more efficient -- a sparse array. I worked in the sparse linear algebra space, and you can gain a lot from sparsity. Add this to your fiddle:

    from scipy.sparse import csc_array
    sp_array = csc_array(my_list, dtype=np.int32)
    print(SZ, " values sparse array size: ", get_obj_size(sp_array))
And you'll get this result:

    10000  values dict size:  501060
    10000  values list size:  426620
    10000  values array size:  80056
    10000  values sparse array size:  80276
EDIT: looks like I was wrong -- I made a mistake in the code. The sparse structure is actually larger.

Also in my earlier result of 280 bytes, the get_obj_size might be reading the metadata part of the data structure. 10k int32 objects (4 bytes) each will not compress to 280 bytes.

But my point in general holds -- sparse structures are usually more efficient to work with than dense structures, especially when you have really large matrices.


Sorry, you have a sorted list of 10,000 Int32, so each of them needs around 32 bits, for a total of 40 KB. I can see how you can store them in 80 KB. I could also see how you would only store the distance to the next one (say 32-log_2(10,0000) = 20 bits on average), for a total of 200,000 bits or 25 KB, if you manage to package it very efficiently.

But how do you store that info in 264 bytes? Something is off there.


Thanks for pointing that out. It was my mistake (corrected in parent post)


You put all the other bytes outside of python's heap :)


To validate your point, I think you need to modify my test and create a 1M elements array with 99% of zeros - then your sparse class will be way more efficient, also hiding the complexity of creating a compact sparse representation at the native level.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: