Hacker Timesnew | past | comments | ask | show | jobs | submitlogin
tcpcp - passing TCP connections between hosts (2005) (sourceforge.net)
85 points by pabs3 on April 25, 2022 | hide | past | favorite | 25 comments


Was there a recent update to the project? I think the state-of-the-art on this (on Linux) is libsoccr (from CRIU) that uses Linux kernel support for tcp checkpoint-restore. If you're ever for real doing that, you'll need some serious plumbing to make sure your tcp socket is doesn't get updated under your nose while you're checkpointing it. Hint: it involves qdiscs, ifb and netlink. Fun project.

I keep hoping one day I'll be able to do the checkpoint operation in one chained io_uring op series :-)


The paper mentioned in the documentation section is available in the Wayback Machine: http://web.archive.org/web/20050515132831/http://www.finux.o...


I read the paper briefly on this. I get that you can 'transfer' a socket to another host by sending where you are in the stream and its state. What I don't get is how exactly does the 'peer' change to the new socket? They will be sending packets to the old IP. Second machine has it's own IP.

If the purpose of using tcpcp is for 'load balancing' or w/e, then why does it seem to need the connections behind a single machine for routing? I assume this is what redirects inbound traffic from the peer to the new machine. But if it still requires a single machine for routing then it will be just as congested and not load-balanced.

What am I missing here?


IBM was doing something like this in their software load balancer ~1996. The technique was used in the large sporting events websites of the day (olympics, wimbledon, ...)

went something like this:

- a routing node (a kernel extension) would receive packets and forward them (verbatim, same src & dest IP) to a backend worker

- backend worker was configured for same IP, which meant: a) arp had to be disabled to avoid conflict, or b) the loopback interface had to be aliased to same IP (using loopback avoids arp conflict and worker still happy to accept the packet and respond)

- response then goes back directly from worker to original source, which was a good approach for scaling web traffic back then, small http requests going through routing node, large responses going back direct from worker to source.

more details:

https://patents.google.com/patent/EP0838931A2/en


In their paper the example is servers sitting behind a switch which respond to ARP requests for the same IP (in addition to their own IP) which allows packets for that shared IP to reach both servers (2 in their example).

The servers accept every 2nd connection + any established+related connection packets. All other packets are dropped.


This might make more sense in the situation where multiple hosts share the same IP, as in a CARP[1] setup. There’s probably other useful use-cases but that’s the one that comes to my mind.

[1] https://en.wikipedia.org/wiki/Common_Address_Redundancy_Prot...


Anycast?


Assuming you can transfer a connection more efficiently than having the peer reconnecting, how do you transfer the L7 application state? What if TLS keys are involved? What if the original host is long gone...


The concept at least seems insanely useful with EC2 spot instances - I’ve always wondered what’s the easiest way to transfer stream connections to another host pending imminent shutdown without the client experiencing any downtime.

It might even be possible to run a service fully on a moderate fleet of ephemeral instances and save over half of the EC2 costs.


What about using a load balancer in front of any two instances?


a) how do you handle pending shutdown of a load balancer

b) sure ok, if it's an http load balancer, you can finish up the existing requests and any new requests will go to new servers; but if it's a tcp load balancer, those don't generally let you swap the server in the middle


Heh, I guess screw them and shut it down anyway - at least that seems to be the policy in my IT department :)


a) There are several load balancers that allow sharing the state tables. So if you control the network, you share/transfer state tables ahead of removing the LB node from routing. Can even be the same IP with ECMP. I don't know how you would do this on something like an AWS network.

b) In general your right, but this is already a fairly niche use case. But I do wonder if reprogramming the conntrack entry in linux would work, or for lvs which already has a userspace daemon for state synchronization, how it would behave if reprogramming an existing state. It's at least not implausible to do that rewrite somewhere real time... if you control the edge system.

Also, again in a you control the network scenario, OpenFlow switches might allow you to reprogram the state tables while they're live.


netlink has all you need https://github.com/tgraf/libnl/blob/master/src/nf-ct-add.c

The Linux kernel stack is crazy.


For conntrack, one possibility might be to bypass conntrack on incoming connections using `iptables -j NOTRACK`, effectively making the firewall stateless.


This does not work AZ / zones most likely, also how do you transfert the state from the upstream routers?


Out of curiosity, is there something protcol-agnostic that is better than lameduck mode?


Is there a similar tool that will handle a TLS layer on top? Or is there something like renegotiation that makes it trivial?


I was looking into this some years back when I was considering building a high-availability IRC bouncer that can pass the TLS IRC connections around with this.

There isn't anything out of the box that I could find, but there was some discussion/prototyping around adding an API for exporting all the necessary key material and metadata to the mbedtls API. With that it would have been "relatively" "easy" to do the TLS bits :)

See https://github.com/Mbed-TLS/mbedtls/issues/3141 and linked ML posts.


There is some security concerns with exporting the ephemeral private key material over a network connection. Technically there is nothing that makes this impossible, but from a policy perspective it may be a nonstarter.


If you're OK with transferring tcp state, I don't see why you wouldn't be ok with transferring TLS state, too. You don't even need to transfer the certificate private key. I've seen some systems where the certificate private key isn't present on most edge nodes; session signing is proxied to centralized nodes and the edge nodes just do the bulk ciphered with the session keys, which necessitates sharing the session keys over a network.


That's why I was thinking more about a key renegotiation, which I know exists in SSH and existed in TLS at one point (at least I know there's a shortcut for it in openssl's s_client).


One of the ideas of tcp checkpoint restore, at least for me, is to avoid a round-trip delay when restoring the connection. Wouldn't a key renegotiation cause at least one round-trip? I'm probably being dense here...


It probably would yes, though that might still be fewer round-trips than establishing a new TLS connection.


Youre right.

I'd actually be really interested in a write-up on such ideas. SSH, TLS or wireguard (voluntary) 'takeover' or Checkpoint&Restore. I don't remember whether QUIC had multihoming (might help with checkpoint restore) but since most APIs are in userland and it's udp it might be faaar easier.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: