Just to amplify his point, if you want your program to take page faults as PHK suggests, it has to be multithreaded. If you choose event-driven concurrency you can't afford to take page faults in mmap() or read(). When you make the threads vs. events decision you're implicitly making a bunch of related decisions about I/O and scheduling as well; a hybrid approach (like using events and mmap) won't work well.
Exactly: one of the two must be threaded, the swapping subsystem or the side serving clients. Since Redis serves clients in an event driven fashion our VM I/O part is threaded. But it was much simpler to design a threaded VM compared to a fully threaded Redis, and anyway there were other good reasons for implementing VM at application level.
Not really; because mincore() does not generate events you'd have to poll to find out when the page-in has finished, which seems pretty wasteful. If I was doing disk I/O from an event-driven program I would rather use the Flash approach of using a small number of worker threads to perform blocking I/O. PHK says "I consider this a deficiency in POSIX standards, not in the concept of VM.", but you have to go to production with the kernel you have, not the kernel you wish you had.
I agree you have to make it work with the real and not ideal kernel :) Another thing PHK misses is that single-threaded approach solves the database concurrency problems for free, and people start to use it more - e.g. Stonebraker in his VoltDB.
Polling mincore wouldn't be that bad as it would only happen between the commands and at least the core of the algorithm would be simple: before executing a command, check if it's data are in memory with mincore, if not ask to load them with madvise, and go to the next command.
This is exactly how Varnish is setup. It's threads basically only exist for I/O reasons. It takes some serious fu to effectively mix event-driven and threaded programming.
You can use events + mmap, you just need to factor the paging latency into your design. Normally, this might mean mmapping a chunk of data at startup, touching it all so that it's resident in RAM, and then beginning to serve queries, keeping an eye on your total resident set so that it never pages out.
The entire point of redis's VM layer is to serve data sets larger than ram when the request distribution is such not everything need be memory resident (I believe this is sometimes referred to as the 1:10 problem in the redis community).
While ram is far cheaper than it once was, there still are substantial savings in reducing your resident set requirements from TBs to just hundreds of GBs.
The point is zero-copy on load, not being able to spill to disk. Most high-performance, scalable servers I've seen ignore virtual memory entirely and kill (+ restart) the process if it exceeds the physical memory available on the machine. Yes, that means they use pre-1960s technology; sometimes, the price of performance is ignoring the programming conveniences we've come up with in the last 50 years.
Most applications need to perform some computations with data that is read from the backing store. Even simple sorts and searches will vastly outweigh the cost of an extra memcpy. Honestly, an HTTP cache is sort of a perfect case for the way in which Varnish was implemented. There is very little actual processing, if any, that needs to be done with the data that's read from disk. It just needs to be read from the backing store and shuttled over the socket as fast as possible, with very little friction. An extra memcpy or two matter in the Varnish scenario.