Note: The performance values mentioned on that page (on an EC2 c1.medium instance using spinning-rust disks!) is wildly out of date. I'll get around to updating them some day.
For reference, on my laptop (Dell Latitude 7390 with an i7-8650U CPU):
* Bulk inserts run at ~600,000/second (up from 125,000).
* Bulk extracts run at ~660,000/second while in RAM (up from 30,000) and ~220,000/second from disk (up from 20,000).
* Bulk updates run at ~700,000/second in RAM dropping to ~300,000 from disk (up from 110,000 dropping to 60,000).
* Random reads run at ~800,000/second in RAM dropping to ~20,000 from disk (up from 220,000 dropping to 11,000).
* Random mixed run at ~400,000/second in RAM dropping to ~10,000 from disk (up from 30,000 dropping to 4,000).
* Hot-spot reads run at ~800,000/second in RAM dropping to ~500,000 from disk (up from 220,000 dropping to 60,000).
Note that "in RAM" means "the dataset fits into RAM" -- in all cases data is durably stored to disk.
There is almost certainly room for improvement; I haven't done extensive profiling yet.
> It was designed to satisfy the needs of the Tarsnap online backup service for high-performance key-value storage, although it is not yet being used for that purpose
Does the fact that you're still maintaining Kivaloo this many years later imply it's now being used by Tarsnap?
Tarsnap has recently started using Kivaloo. Over time I intend to use it far more -- but since Tarsnap is a backup service, I'm starting with the least critical parts first.
Out of curiosity, what was the use case/upstream requirement for building Kivaloo in the first place?
The Kivaloo page notes that it was designed for Tarsnap, but I'm curious as to the "why" - eg, an eventual-consistency/consensus building block, a simple local metadata service, server-side housekeeping...?
(I'm also curious what "{indexed ,}sorted files" means, and how UFS is significant.)
1. What are, off top of your head, some design changes or code changes required that'd bring drastic performance improvements?
2. What are some key internals that you think differentiate Kivaloo from other embedded KV stores? I assume you must have gone through a lot of existing literature on the topic before building this. For example, LMDB, BDB, RocksDB, LevelDB, SQLite and the likes come to mind that can double-up as KV stores.
3. Does it store the database in flat files with a WAL in front? Is the file format of the database custom, or based on existing formats?
4. Does the database auto index the fields? Or, use any other such aids to speed up access to data?
No, I mean with a dataset which is too large to fit into the amount of RAM on the system.
1. What are, off top of your head, some design changes or code changes required that'd bring drastic performance improvements?
Nothing immediately comes to mind. Profiling may reveal some improvements, of course.
2. What are some key internals that you think differentiate Kivaloo from other embedded KV stores? I assume you must have gone through a lot of existing literature on the topic before building this. For example, LMDB, BDB, RocksDB, LevelDB, SQLite and the likes come to mind that can double-up as KV stores.
Well... kivaloo isn't an embedded KV store, so that would be a big differentiating factor. It's a network daemon.
3. Does it store the database in flat files with a WAL in front? Is the file format of the database custom, or based on existing formats?
The "on-disk" format is the pages of a append-only B+Tree, with the last page being the tree root.
I put "on-disk" in scare quotes because there are other backends, e.g. using Amazon DynamoDB to store pages.
4. Does the database auto index the fields? Or, use any other such aids to speed up access to data?
There are no fields. Key-value pairs, nothing more.
You are correct, as things are at the present time. I designed kivaloo to be composable with the intention that I could put a replication/sharding layer in front of it later, however.
I am genuinely intrigued about how much is costs to commission those tests. They are so complete and thorough that I can't see myself being able to afford one for some project of mine, but I would absolutely like to to have an estimate
These are actually some really good numbers for a certain application I'm looking at (https://hackertimes.com/item?id=24191307), especially with bulk inserts being so high.
A few questions:
- Why 255 bytes to 255 bytes? Does this have any performance consequences?
- Your laptop is SSD right?
- Is there any more documentation on how to use Kivaloo? (not just facts about it)
The 255 byte limit is because I store keys and values as a one-byte length followed by the relevant data. For Tarsnap what I typically want is ~40 byte keys and data (hence using those sizes for benchmarking).
Yes, my laptop has a Intel 660p 512 GB NVMe disk.
No documentation per se, although I hope the library interfaces are reasonably understandable. I'd be happy to help though -- this code deserves to be used!
Is the one-byte length central to some logic (or some performance concerns), or could one hack the `uint8_t` to a `uint16_t` in kvldskey and so on to get something that could store a bit more? We have some systems that need only 16-32 byte keys but upwards of around 16k values, we're currently using RocksDB but these performance numbers + no compaction process + mux approach are making me interested...
Interesting question. You would need to adjust kvldskey and all of the places where they are serialized (e.g. in pages and in the network protocol). You would also need a much larger page size -- the maximum key+value pair size has to fit into 1/3 of a page.
But if you're willing to make those adjustments and use 64 kB pages, I imagine it would work just fine. Please stay in touch!
Forgive my complete ignorance/inexperience, but why does the page size need to be increased to 64kB, and why does the key+value length(?) have to fit into 1/3 of a page?
B+Tree leaf nodes contain keys and values. In kvlds (which is the core B+Tree component of kivaloo), the code aims to keep nodes at least 2/3 full, in order to keep the tree approximately balanced. If a key-value pair can take more than 1/3 of a page, you could have a situation where a node is less than 2/3 of a page but adding another key-value pair would take it above the maximum allowed size.
This restriction could be relaxed, e.g. to require only that internal nodes are at least 2/3 full, in which case for the small key / large value case you could have pages which are barely larger than the largest value. I didn't bother doing that since it wasn't relevant to my usage.
I don't think I wrote much documentation about this, sorry. Basically the idea is that there's a pointer which moves through the key-space looking at pages, and if it passes any pages which are "old" it marks them as dirty so that they're rewritten as part of the next batch. The rate at which the cleaning pointer moves through key-space depends on the accumulated "cleaning debt", which is based on the amount of garbage along with the current I/O rate; the aim is to hit a steady-state where the total amount of I/O is constant and the cleaning gets the "left over" I/O after requests are serviced.
For reference, on my laptop (Dell Latitude 7390 with an i7-8650U CPU):
* Bulk inserts run at ~600,000/second (up from 125,000).
* Bulk extracts run at ~660,000/second while in RAM (up from 30,000) and ~220,000/second from disk (up from 20,000).
* Bulk updates run at ~700,000/second in RAM dropping to ~300,000 from disk (up from 110,000 dropping to 60,000).
* Random reads run at ~800,000/second in RAM dropping to ~20,000 from disk (up from 220,000 dropping to 11,000).
* Random mixed run at ~400,000/second in RAM dropping to ~10,000 from disk (up from 30,000 dropping to 4,000).
* Hot-spot reads run at ~800,000/second in RAM dropping to ~500,000 from disk (up from 220,000 dropping to 60,000).
Note that "in RAM" means "the dataset fits into RAM" -- in all cases data is durably stored to disk.
There is almost certainly room for improvement; I haven't done extensive profiling yet.