Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

I have been using Riak's secondary indexes for my latest project and have generally found them a joy to use. However, I do have to have to question a bit the way they are architected.

Assuming you are using Riak's default configuration each range query hits 1/3 of the cluster, which could get pretty hairy on large clusters that have lots of requests. Also, there is no pagination, so if an index has a million objects you'll have to be prepared to wait even if you only want the first part of the query.

You could solve this by putting a sort value in the key and using a range query, but this wouldn't work if you want the most recent items keyed with time, because the items could be unevenly spaced back in time. Also, Riak, like many databases based on Dynamo, thrives on fat data which one would think would favor lists. LevelDB is also supposedly slower than Bitcask, the default backend, but I'm not sure if this is still true.

I've been trying to think of ways around these problems. A simple thought I had was to simply cache the response as pages in Riak. Although this introduces new problems like how to know how often to reset the cache, too often and I may as well not have this cache, too infrequently and users get stale data. I would also have to handle this using worker threads because I wouldn't want the odd 100th user to get a big latency hit. The database would also either have to be continually polled, wasting CPU, or potentially not have the data cached when needed.

Another solution I've been considering is to write a secondary index layer on top of Riak using a skiplist or btree to know where to add and remove data when it gets to be very large. This seems like a cool idea, but might be tricky to implement and do conflict resolutions on.

My last idea was the most ambitious, which was to implement a separate distributed database specifically for secondary indexes and range queries which would not be bound by Dynamo. The idea here is to have each node in charge of a segment of the key space (like Big Table) and then have it split and coalesce not only based on size, but also on frequency of reads and writes to handle the bottleneck problem.

I initially was going to have this paired with the Dynamo database (https://github.com/dbunker/Dynago) I was experimenting with using Go and LevelDB, but there is no reason it couldn't work with any Key-Value eventually-consistent hyper-reliable database to provide light-weight secondary indexes. Having it constantly check the core key-value database would mean it wouldn't have to be super reliable in its own right and so could be kept relatively simple.

But again the simplest solution may ultimately be the way to go, I'm not sure, all these seem to have pretty big trade offs.



Assuming you are using Riak's default configuration each range query hits 1/3 of the cluster, which could get pretty hairy on large clusters that have lots of requests. Also, there is no pagination, so if an index has a million objects you'll have to be prepared to wait even if you only want the first part of the query.

You could solve this by putting a sort value in the key and using a range query, but this wouldn't work if you want the most recent items keyed with time, because the items could be unevenly spaced back in time.

Pagination is coming soon; it's in riak_kv master already, but in buyer-beware #yolo territory.

LevelDB is also supposedly slower than Bitcask, the default backend, but I'm not sure if this is still true.

Bitcask is faster when all the keys fit in memory: it's designed to load any value with a single disk seek. LevelDB can't make that guarantee, but neither can Bitcask with too many keys for available memory.

I've been trying to think of ways around these problems. A simple thought I had was to simply cache the response as pages in Riak. Although this introduces new problems like how to know how often to reset the cache, too often and I may as well not have this cache, too infrequently and users get stale data.

Caching is one of the two hard problems in software engineering (along with "naming things" and "off-by-one errors"), so good luck :) If you're not opposed to running a separate service, Memcache is what I'd use.


Any thoughts on when the pagination goes live? I can't find any information on it online. Memcache would be a good choice, but I am wondering, if I have a few secondary indexes with over a million indexes each, wouldn't continually recreating this cache irreparably bog down the cluster?


I believe it's part of Riak 1.4, which is our next release; no date yet.

The number of entries in a 2i isn't going to bog down querying it any more than lots of objects bog down LevelDB. Make sure your indexes have the right content with the right cardinalities and it shouldn't be a problem.

If you want to drop in to #riak on freenode tomorrow (I'm in the America/New_York time zone) I'm brycek in there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: