Just to amplify his point, if you want your program to take page faults as PHK s...

antirez · on Oct 5, 2010

Exactly: one of the two must be threaded, the swapping subsystem or the side serving clients. Since Redis serves clients in an event driven fashion our VM I/O part is threaded. But it was much simpler to design a threaded VM compared to a fully threaded Redis, and anyway there were other good reasons for implementing VM at application level.

ntoshev · on Oct 5, 2010

Be sure to check PHK's comment (#5)

madvise + mincore syscalls could be used to get the kernel to preload the pages asynchronously.

wmf · on Oct 5, 2010

Not really; because mincore() does not generate events you'd have to poll to find out when the page-in has finished, which seems pretty wasteful. If I was doing disk I/O from an event-driven program I would rather use the Flash approach of using a small number of worker threads to perform blocking I/O. PHK says "I consider this a deficiency in POSIX standards, not in the concept of VM.", but you have to go to production with the kernel you have, not the kernel you wish you had.

ntoshev · on Oct 6, 2010

I agree you have to make it work with the real and not ideal kernel :) Another thing PHK misses is that single-threaded approach solves the database concurrency problems for free, and people start to use it more - e.g. Stonebraker in his VoltDB.

Polling mincore wouldn't be that bad as it would only happen between the commands and at least the core of the algorithm would be simple: before executing a command, check if it's data are in memory with mincore, if not ask to load them with madvise, and go to the next command.

What does Flash use worker threads for?

wmf · on Oct 6, 2010

Flash used worker threads for any operation that could potentially block, like disk reads.

http://www.usenix.org/event/usenix99/full_papers/pai/pai_htm...

rbranson · on Oct 5, 2010

This is exactly how Varnish is setup. It's threads basically only exist for I/O reasons. It takes some serious fu to effectively mix event-driven and threaded programming.

nostrademons · on Oct 5, 2010

You can use events + mmap, you just need to factor the paging latency into your design. Normally, this might mean mmapping a chunk of data at startup, touching it all so that it's resident in RAM, and then beginning to serve queries, keeping an eye on your total resident set so that it never pages out.

jasonwatkinspdx · on Oct 5, 2010

The entire point of redis's VM layer is to serve data sets larger than ram when the request distribution is such not everything need be memory resident (I believe this is sometimes referred to as the 1:10 problem in the redis community).

While ram is far cheaper than it once was, there still are substantial savings in reducing your resident set requirements from TBs to just hundreds of GBs.

rbranson · on Oct 5, 2010

What would be the point of being able to spill to disk if you've got to keep everything in RAM? Simple serialization?

nostrademons · on Oct 5, 2010

The point is zero-copy on load, not being able to spill to disk. Most high-performance, scalable servers I've seen ignore virtual memory entirely and kill (+ restart) the process if it exceeds the physical memory available on the machine. Yes, that means they use pre-1960s technology; sometimes, the price of performance is ignoring the programming conveniences we've come up with in the last 50 years.

rbranson · on Oct 5, 2010

Most applications need to perform some computations with data that is read from the backing store. Even simple sorts and searches will vastly outweigh the cost of an extra memcpy. Honestly, an HTTP cache is sort of a perfect case for the way in which Varnish was implemented. There is very little actual processing, if any, that needs to be done with the data that's read from disk. It just needs to be read from the backing store and shuttled over the socket as fast as possible, with very little friction. An extra memcpy or two matter in the Varnish scenario.

shin_lao · on Oct 6, 2010

Actually the hybrid approach works extremely well, it's just more delicate to architecture.

The idea is to process in parallel everything you can and have the services ("processors") communicate asynchronously by events.