tcpcp - passing TCP connections between hosts (2005)

touisteur · on April 25, 2022

Was there a recent update to the project? I think the state-of-the-art on this (on Linux) is libsoccr (from CRIU) that uses Linux kernel support for tcp checkpoint-restore. If you're ever for real doing that, you'll need some serious plumbing to make sure your tcp socket is doesn't get updated under your nose while you're checkpointing it. Hint: it involves qdiscs, ifb and netlink. Fun project.

I keep hoping one day I'll be able to do the checkpoint operation in one chained io_uring op series :-)

rzzzt · on April 25, 2022

The paper mentioned in the documentation section is available in the Wayback Machine: http://web.archive.org/web/20050515132831/http://www.finux.o...

Uptrenda · on April 25, 2022

I read the paper briefly on this. I get that you can 'transfer' a socket to another host by sending where you are in the stream and its state. What I don't get is how exactly does the 'peer' change to the new socket? They will be sending packets to the old IP. Second machine has it's own IP.

If the purpose of using tcpcp is for 'load balancing' or w/e, then why does it seem to need the connections behind a single machine for routing? I assume this is what redirects inbound traffic from the peer to the new machine. But if it still requires a single machine for routing then it will be just as congested and not load-balanced.

What am I missing here?

kbouck · on April 26, 2022

IBM was doing something like this in their software load balancer ~1996. The technique was used in the large sporting events websites of the day (olympics, wimbledon, ...)

went something like this:

- a routing node (a kernel extension) would receive packets and forward them (verbatim, same src & dest IP) to a backend worker

- backend worker was configured for same IP, which meant: a) arp had to be disabled to avoid conflict, or b) the loopback interface had to be aliased to same IP (using loopback avoids arp conflict and worker still happy to accept the packet and respond)

- response then goes back directly from worker to original source, which was a good approach for scaling web traffic back then, small http requests going through routing node, large responses going back direct from worker to source.

more details:

https://patents.google.com/patent/EP0838931A2/en

0xCMP · on April 25, 2022

In their paper the example is servers sitting behind a switch which respond to ARP requests for the same IP (in addition to their own IP) which allows packets for that shared IP to reach both servers (2 in their example).

The servers accept every 2nd connection + any established+related connection packets. All other packets are dropped.

lcampbell · on April 25, 2022

This might make more sense in the situation where multiple hosts share the same IP, as in a CARP[1] setup. There’s probably other useful use-cases but that’s the one that comes to my mind.

[1] https://en.wikipedia.org/wiki/Common_Address_Redundancy_Prot...

vlovich123 · on April 26, 2022

Anycast?

nyc_pizzadev · on April 25, 2022

Assuming you can transfer a connection more efficiently than having the peer reconnecting, how do you transfer the L7 application state? What if TLS keys are involved? What if the original host is long gone...

leohonexus · on April 25, 2022

The concept at least seems insanely useful with EC2 spot instances - I’ve always wondered what’s the easiest way to transfer stream connections to another host pending imminent shutdown without the client experiencing any downtime.

It might even be possible to run a service fully on a moderate fleet of ephemeral instances and save over half of the EC2 costs.

kreetx · on April 25, 2022

What about using a load balancer in front of any two instances?

toast0 · on April 25, 2022

a) how do you handle pending shutdown of a load balancer

b) sure ok, if it's an http load balancer, you can finish up the existing requests and any new requests will go to new servers; but if it's a tcp load balancer, those don't generally let you swap the server in the middle

wanderer_ · on April 25, 2022

Heh, I guess screw them and shut it down anyway - at least that seems to be the policy in my IT department :)

kevin_nisbet · on April 25, 2022

a) There are several load balancers that allow sharing the state tables. So if you control the network, you share/transfer state tables ahead of removing the LB node from routing. Can even be the same IP with ECMP. I don't know how you would do this on something like an AWS network.

b) In general your right, but this is already a fairly niche use case. But I do wonder if reprogramming the conntrack entry in linux would work, or for lvs which already has a userspace daemon for state synchronization, how it would behave if reprogramming an existing state. It's at least not implausible to do that rewrite somewhere real time... if you control the edge system.

Also, again in a you control the network scenario, OpenFlow switches might allow you to reprogram the state tables while they're live.

touisteur · on April 25, 2022

netlink has all you need https://github.com/tgraf/libnl/blob/master/src/nf-ct-add.c

The Linux kernel stack is crazy.

leohonexus · on April 26, 2022

For conntrack, one possibility might be to bypass conntrack on incoming connections using `iptables -j NOTRACK`, effectively making the firewall stateless.

Thaxll · on April 25, 2022

This does not work AZ / zones most likely, also how do you transfert the state from the upstream routers?

oh_sigh · on April 26, 2022

Out of curiosity, is there something protcol-agnostic that is better than lameduck mode?

remram · on April 25, 2022

Is there a similar tool that will handle a TLS layer on top? Or is there something like renegotiation that makes it trivial?

dxld · on April 25, 2022

I was looking into this some years back when I was considering building a high-availability IRC bouncer that can pass the TLS IRC connections around with this.

There isn't anything out of the box that I could find, but there was some discussion/prototyping around adding an API for exporting all the necessary key material and metadata to the mbedtls API. With that it would have been "relatively" "easy" to do the TLS bits :)

See https://github.com/Mbed-TLS/mbedtls/issues/3141 and linked ML posts.

jandrese · on April 25, 2022

There is some security concerns with exporting the ephemeral private key material over a network connection. Technically there is nothing that makes this impossible, but from a policy perspective it may be a nonstarter.

toast0 · on April 25, 2022

If you're OK with transferring tcp state, I don't see why you wouldn't be ok with transferring TLS state, too. You don't even need to transfer the certificate private key. I've seen some systems where the certificate private key isn't present on most edge nodes; session signing is proxied to centralized nodes and the edge nodes just do the bulk ciphered with the session keys, which necessitates sharing the session keys over a network.

remram · on April 25, 2022

That's why I was thinking more about a key renegotiation, which I know exists in SSH and existed in TLS at one point (at least I know there's a shortcut for it in openssl's s_client).

touisteur · on April 25, 2022

One of the ideas of tcp checkpoint restore, at least for me, is to avoid a round-trip delay when restoring the connection. Wouldn't a key renegotiation cause at least one round-trip? I'm probably being dense here...

remram · on April 26, 2022

It probably would yes, though that might still be fewer round-trips than establishing a new TLS connection.

touisteur · on May 3, 2022

Youre right.

I'd actually be really interested in a write-up on such ideas. SSH, TLS or wireguard (voluntary) 'takeover' or Checkpoint&Restore. I don't remember whether QUIC had multihoming (might help with checkpoint restore) but since most APIs are in userland and it's udp it might be faaar easier.