Sendfile: a system call for web developers to know about

tyingq · on Jan 23, 2016

Related...Apache, nginx, and probably other web servers can be coerced into handling sendfile() for you.

https://tn123.org/mod_xsendfile/ https://www.nginx.com/resources/wiki/start/topics/examples/x...

dullgiulio · on Jan 23, 2016

Also any webserver written in Go uses sendfile transparently when sending mmap-able data down to a socket.

mixmastamyk · on Jan 23, 2016

That nginx config is hard to grok. Is it really that complicated and not just "sendfile = on" ?

dividuum · on Jan 23, 2016

If you have a public url for a file, then you can just point your users to that url and use sendfile=on. Nginx will then use sendfile.

But consider the usecase where you have a bigger download hidden behind some kind of authentication. You'd do that authentication in your normal backend and return the X-Accel-Redirect header in your response. Nginx will then take over and send the data using sendfile.

tyre · on Jan 23, 2016

You are correct! There is a simple sendfile in the http core module.

http://nginx.org/en/docs/http/ngx_http_core_module.html#send...

tyingq · on Jan 24, 2016

I linked to the "harder to grok" piece because that's what would give you programmatic control by setting an X header. That's as opposed to calling sendfile() yourself from your app. Helpful if you're trying to do something more than just serve up static files...like maybe in a page cache, where you also want dynamic control over other headers, etc.

The simpler sendfile=on would just be instructing nginx to use it for static files.

luckydude · on Jan 23, 2016

This is really old but the article mentions splice() and here's the original thinking on that syscall. Linux didn't take all the ideas but they took some:

http://www.mcvoy.com/lm/bitmover/lm/papers/splice.pdf

joveian · on Jan 24, 2016

Or even earlier there is "Exploiting In-Kernel Data Paths to Improve I/O Throughput and CPU Availability" by Fall and Pasquale: https://cseweb.ucsd.edu/~pasquale/Research/Papers/usenix93.p...

amelius · on Jan 23, 2016

Seems quite useless to me. For, what if you want to prefix the file with a header first, or add a trailer? Or what if you want to escape certain parts, or zip the file before sending? Or if you want to send the file incrementally?

It is better (more powerful) to have simple primitives, and work up from them.

If you really want something like sendfile, you could also just find the right library.

luckydude · on Jan 23, 2016

Any well written server application eventually comes down to how fast can you copy the data. This avoids a copy so it can double performance.

As for headers/trailers, it's a file descriptor, nothing prevents you from writing to it before or after the sendfile() call.

Simple primitives are fine but this is faster. Unless you are going to do the page flipping tricks that Irix did, you can't approach this speed (and Irix tricks aren't as fast as this, better than the bcopy() but not as fast as this).

Matthias247 · on Jan 23, 2016

Reality is sometimes a bit different from theory here. Let's say you want to send a small amount of data (1k) from a file to network, some header before it and some trailer behind it. With sendfile you would do three system calls, two normal writes and a sendfile. Without it you would read the file, copy the content, the header and the trailer into a single buffer and issue a single write syscall with it (of course you have the read call before). Due to the lower number of write calls this could be faster then sendfile, as copying a small amount of data might yield a lower overhead then multiple write requests. Of course that situation will change depending on the amount of data - so I think sendfile is one possible optimization that should be benchmarked for the particular application before using it everywhere.

luckydude · on Jan 24, 2016

I'm well aware of this, there is a set of benchmarks in lmbench, which I wrote, that shows something similar to what you are talking about. I called it open to close performance and it plotted the MB/sec as a function of file size for read/write vs mmap.

One would think that mmap is faster because it does less work but that's not true if you include the open/mmap, read is always fast in that case (which is surprising to me, it used to be that there was a cross over point).

But your whole comment sort of misses the point of sendfile. If you are doing little file transfers, use read/write, they are fine. The overhead of the syscalls matter if the size is small. But sendfile is for big data and there the limit, of perf, as the size goes to infinity, is 2x.

paulddraper · on Jan 23, 2016

Three syscalls instead of one will probably not make a difference. Copying a 50mb file multiple times will.

ryanpetrich · on Jan 23, 2016

But they aren't free and if the file size is sufficiently small, the overhead of the constant overhead of the additional syscalls will dwarf the variable overhead of the extra copy. FreeBSD/OS X's sendfile allows specifying a header and trailer in a single syscall to avoid this limitation.

paulddraper · on Jan 24, 2016

I'd be interested in reading how small the file has to be for this to be necessary.

jsolson · on Jan 23, 2016

> Seems quite useless to me. For, what if you want to prefix the file with a header first, or add a trailer? Or what if you want to escape certain parts, or zip the file before sending? Or if you want to send the file incrementally?

The BSD variant supports headers and trailers, and even if it didn't you're under no obligation to not use other system calls to send data on the socket.

If you're running the kind of workload where sendfile() is valuable you would tend to pre-cache variants that you intended to serve. Consider: why escape the file every time I serve it rather than doing it once?

Even if you did some of these transforms on-the-fly, you can still benefit from sendfile() by caching the transformed version of the file in a tmpfs mount and serving future requests for the same variant from the cached version (with sendfile()).

> If you really want something like sendfile, you could also just find the right library.

Due to the nature of BSD sockets you cannot implement sendfile() in a library. The write() system call (and its vectorized equivalents, along with send(), sendmsg(), etc.) all make a copy of the data to be sent into kernel network buffers[0]. This copy from user mode to kernel mode accounts for the bulk of the computational cost of building a static HTTP server, for example, and can become the bottleneck in terms of maxing out the performance of said server. sendfile() is able to elide this copy by taking advantage of kernel-resident filesystem caches. For any given file the first call to sendfile() will bring the file into memory; assuming a sufficient small corpus (or a sufficiently lucky eviction strategy) the bulk of files served can be served directly from kernel memory.

Also, it's certainly lighter weight than putting the entire HTTP server in the kernel (which, as I understand it, is what Microsoft did at some point for IIS).

[0]: With a sufficiently enlightened memory management system in the kernel one could potentially make these zero-copy for data that was, say, mmap()'d. This would offer the potential for sendfile() like performance from a user-mode library, but would require much more delicate plumbing of memory mapped pages through the networking subsystem.

amelius · on Jan 23, 2016

You could also say that the aio interface, which has aio_read() and aio_write() calls, is missing an "aio_transfer()" call.

Just a quick idea, would mmapped files, combined with the aio_write() interface not give us the same functionality as sendfile()?

jsolson · on Jan 23, 2016

Maybe. It depends on whether or not the kernel is smart enough to recognize that the pages you've referenced in the AIO operation are pages are mmap'd files rather than ordinary heap memory. It would need to do some amount of page table walking to learn that, but certainly for a large enough operation that would be cheaper than simply making the copy.

I suspect that Linux, at least, does not do this. That said, I'm curious about the difficulty of plumbing such an optimization. Unfortunately due to other implementation details of the networking stack I work on day to day it wouldn't have any implications for that, but it might have some for one of my side projects.

julik · on Jan 23, 2016

It is in fact very very useful but underutilized, for a very specific purpose - serving very large amounts of data from pretty much any runtime (like Ruby or Python). With those runtimes it works great because you can avoid stuffing the contents of the files you send out into the VM memory, greatly reducing memory and GC pressure. You can also cache the computation product in files, and then send those files out via sendfile() - and the computation can happen on a background thread or in a parallel process.

krakensden · on Jan 23, 2016

If it's not in the kernel, libraries can't wrap it ...