Note: The performance values mentioned on that page (on an EC2 c1.medium instanc...

aclindsa · on Sept 21, 2020

> It was designed to satisfy the needs of the Tarsnap online backup service for high-performance key-value storage, although it is not yet being used for that purpose

Does the fact that you're still maintaining Kivaloo this many years later imply it's now being used by Tarsnap?

cperciva · on Sept 21, 2020

Tarsnap has recently started using Kivaloo. Over time I intend to use it far more -- but since Tarsnap is a backup service, I'm starting with the least critical parts first.

koolba · on Sept 21, 2020

It wasn’t using it till now? What was the equivalent component previously?

cperciva · on Sept 21, 2020

Many things for different purposes. Files in UFS filesystems. Sorted files. Indexed sorted files. In one case, Amazon SimpleDB.

exikyut · on Sept 22, 2020

Out of curiosity, what was the use case/upstream requirement for building Kivaloo in the first place?

The Kivaloo page notes that it was designed for Tarsnap, but I'm curious as to the "why" - eg, an eventual-consistency/consensus building block, a simple local metadata service, server-side housekeeping...?

(I'm also curious what "{indexed ,}sorted files" means, and how UFS is significant.)

ignoramous · on Sept 21, 2020

> ...dropping to ~20,000 from disk

Do you mean with the OS file cache disabled?

Other questions:

1. What are, off top of your head, some design changes or code changes required that'd bring drastic performance improvements?

2. What are some key internals that you think differentiate Kivaloo from other embedded KV stores? I assume you must have gone through a lot of existing literature on the topic before building this. For example, LMDB, BDB, RocksDB, LevelDB, SQLite and the likes come to mind that can double-up as KV stores.

3. Does it store the database in flat files with a WAL in front? Is the file format of the database custom, or based on existing formats?

4. Does the database auto index the fields? Or, use any other such aids to speed up access to data?

Thanks.

cperciva · on Sept 21, 2020

Do you mean with the OS file cache disabled?

No, I mean with a dataset which is too large to fit into the amount of RAM on the system.

1. What are, off top of your head, some design changes or code changes required that'd bring drastic performance improvements?

Nothing immediately comes to mind. Profiling may reveal some improvements, of course.

2. What are some key internals that you think differentiate Kivaloo from other embedded KV stores? I assume you must have gone through a lot of existing literature on the topic before building this. For example, LMDB, BDB, RocksDB, LevelDB, SQLite and the likes come to mind that can double-up as KV stores.

Well... kivaloo isn't an embedded KV store, so that would be a big differentiating factor. It's a network daemon.

3. Does it store the database in flat files with a WAL in front? Is the file format of the database custom, or based on existing formats?

The "on-disk" format is the pages of a append-only B+Tree, with the last page being the tree root.

I put "on-disk" in scare quotes because there are other backends, e.g. using Amazon DynamoDB to store pages.

4. Does the database auto index the fields? Or, use any other such aids to speed up access to data?

There are no fields. Key-value pairs, nothing more.

AkshatM · on Sept 21, 2020

Have you ever considered getting a Jepsen test done for kivaloo? The claims re: durability and linearizability are worth proving out that way

tedunangst · on Sept 21, 2020

Correct me if wrong, but kivaloo isn't distributed. Jepsen seems fairly overkill without replication, partition, etc.

cperciva · on Sept 21, 2020

You are correct, as things are at the present time. I designed kivaloo to be composable with the intention that I could put a replication/sharding layer in front of it later, however.

aneutron · on Sept 21, 2020

I am genuinely intrigued about how much is costs to commission those tests. They are so complete and thorough that I can't see myself being able to afford one for some project of mine, but I would absolutely like to to have an estimate

AkshatM · on Sept 21, 2020

Also, if you're willing to just write the tests yourself, Jepsen is open-source and Kyle Kingsbury offers a tutorial on how to use it on GitHub.

You might not get the same level of sophistication right away, of course, but that's just a matter of time and diligence.

AkshatM · on Sept 21, 2020

Jepsen's website says you can contact Kyle at aphyr [at] jepsen [dot] io for pricing.

koeng · on Sept 21, 2020

These are actually some really good numbers for a certain application I'm looking at (https://hackertimes.com/item?id=24191307), especially with bulk inserts being so high.

A few questions:

- Why 255 bytes to 255 bytes? Does this have any performance consequences?

- Your laptop is SSD right?

- Is there any more documentation on how to use Kivaloo? (not just facts about it)

cperciva · on Sept 21, 2020

The 255 byte limit is because I store keys and values as a one-byte length followed by the relevant data. For Tarsnap what I typically want is ~40 byte keys and data (hence using those sizes for benchmarking).

Yes, my laptop has a Intel 660p 512 GB NVMe disk.

No documentation per se, although I hope the library interfaces are reasonably understandable. I'd be happy to help though -- this code deserves to be used!

morelisp · on Sept 21, 2020

Is the one-byte length central to some logic (or some performance concerns), or could one hack the `uint8_t` to a `uint16_t` in kvldskey and so on to get something that could store a bit more? We have some systems that need only 16-32 byte keys but upwards of around 16k values, we're currently using RocksDB but these performance numbers + no compaction process + mux approach are making me interested...

cperciva · on Sept 21, 2020

Interesting question. You would need to adjust kvldskey and all of the places where they are serialized (e.g. in pages and in the network protocol). You would also need a much larger page size -- the maximum key+value pair size has to fit into 1/3 of a page.

But if you're willing to make those adjustments and use 64 kB pages, I imagine it would work just fine. Please stay in touch!

exikyut · on Sept 22, 2020

Forgive my complete ignorance/inexperience, but why does the page size need to be increased to 64kB, and why does the key+value length(?) have to fit into 1/3 of a page?

cperciva · on Sept 22, 2020

B+Tree leaf nodes contain keys and values. In kvlds (which is the core B+Tree component of kivaloo), the code aims to keep nodes at least 2/3 full, in order to keep the tree approximately balanced. If a key-value pair can take more than 1/3 of a page, you could have a situation where a node is less than 2/3 of a page but adding another key-value pair would take it above the maximum allowed size.

This restriction could be relaxed, e.g. to require only that internal nodes are at least 2/3 full, in which case for the small key / large value case you could have pages which are barely larger than the largest value. I didn't bother doing that since it wasn't relevant to my usage.

jasonwatkinspdx · on Sept 21, 2020

Is there a place I could learn more about how the background cleaning works?

cperciva · on Sept 21, 2020

I don't think I wrote much documentation about this, sorry. Basically the idea is that there's a pointer which moves through the key-space looking at pages, and if it passes any pages which are "old" it marks them as dirty so that they're rewritten as part of the next batch. The rate at which the cleaning pointer moves through key-space depends on the accumulated "cleaning debt", which is based on the amount of garbage along with the current I/O rate; the aim is to hit a steady-state where the total amount of I/O is constant and the cleaning gets the "left over" I/O after requests are serviced.

jasonwatkinspdx · on Sept 23, 2020

Just wanted to follow up to say I really like how clean and well commented this code is.

jasonwatkinspdx · on Sept 21, 2020

Thanks! I'll check out the code in a bit, but that context will help.