Swarm consensus [pdf]

judofyr · on Dec 28, 2021

I read through the paper when the first draft was published on Arxiv a few weeks ago. It's a very neat algorithm and there seems to be a cool underlying concept here, but I wasn't quite able to make it practical.

This algorithm requires a lot of messages. E.g. if you have four nodes positioned in a line and an action is starting at the end, it will need in total 15 messages (A sends 4 messages to B; B sends 3 messages to both A+C; C sends 2 messages to both B+D; D sends one message to C). And in general what happens is that as the action is being propagated to the boundary of the graph, the current nodes keep "heart beating" to each other: They continue sending messages to each other which doesn't contribute to any specific exchange of information (other than staying in sync).

In addition this is heavily round-based (the nodes only progress once they've received responses from all neighbors) which means that one slow node ("straggler") can stall everything.

I'm super interested if it's possible to expand on the ideas in the paper though!

gritzko · on Dec 28, 2021

Actually, I may address it here.

> This algorithm requires a lot of messages

The algorithm requires (1) data propagation (proposal) messages and (2) heartbeat messages that confirm progress similar to TCP ACKs. The amount of heartbeat messages is not dependent on the presence/absence/number of proposals. Both types of messages are more or less unavoidable.

> one slow node ("straggler") can stall everything

In fully asynchronous systems, consensus is impossible anyway. Hence, there is always some turn duration/timeout value.

DarylZero · on Dec 28, 2021

Have you looked at Stellar at all?

https://www.stellar.org/papers/stellar-consensus-protocol

https://medium.com/interstellar/understanding-the-stellar-co...

gritzko · on Dec 28, 2021

Thanks, we will address that.

saurik · on Dec 28, 2021

At least at a high level, this sounds a lot like Avalanche? Each node tells their neighbors about something and eventually a phase transition is passed that makes it super difficult for anyone to convince anyone something different happened? (Which itself claims to be revolutionary but has always felt like an example of an Ising model to me, which others have used for consensus.)

My biggest concern, at a lower level, is the mechanism used to try to get away from this algorithm having "turns"--which the real world does not have as there is no shared clock due to distance effects (which is made all the worse due to the Internet not satisfying triangle inequality)--seems to rely on a fixed diameter and some timing assumptions that squick me (as someone who always wants to find asynchronous solutions).

gritzko · on Dec 28, 2021

On the formal side, SC is deterministic and nodes only talk to their friends, which is the opposite of Avalanche. The article has some calculations for the case of real-world time, but in the end it is always a question of acceptable timeouts. In a fully asynchronous system, deterministic consensus is impossible. What does it mean practically is a different story though.

Regarding distance effects, if node clocks are well synchronized (like, NTP well, not TrueTime well) and the time step is long enough, then the heartbeat should work fine, I believe.

DarylZero · on Dec 28, 2021

How does this compare to Stellar?

https://www.stellar.org/papers/stellar-consensus-protocol

https://medium.com/interstellar/understanding-the-stellar-co...

gritzko · on Dec 28, 2021

If I understand correctly, Stellar assumes there is some core of the network where quorum slices intersect?

noxer · on Dec 28, 2021

Don't know about Stellar but the original XRPL consensus which Stellar is based on requires each validator "node" (nodes that participate in consensus finding) to define which other nodes it listens to (This list of other nodes doesn't have to be public.) But if you would make all the lists public the most overlapped nodes would be the core. A common misconception is however that nodes in this core somehow have more rights or can somehow affect the outcome of the consensus. Which is not the case.

Each node enforces all rules. The overlap is only requires in the sense of that it prevents forking. I.e. if there would be 2 loosely connected "cores" the same size and they get disconnected form each other, both sides could keep going which is objectively worse than if the system halts until the problem is resolved. This "emergency halting" is only guaranteed if there is one core that is overlapped enough so that no 2 separated parts or the network could reach the 80% quorum.

DarylZero · on Dec 28, 2021

As I understand it, the protocol doesn't assume that.

It's just an assumption of the payment processing network built on top, that there will be some "strong set" or whatever of nodes that includes big banking institutions and that actual human beings will recognize as canonical.

api · on Dec 28, 2021

Neat paper, but: "Gnomes know each other, to start with, so the problem itself is non-existant, not to mention the price of the solution. Similarly, the Byzantine generals problem was found to be irrelevant at first. Gnomes know who their friends are."

That makes it useless for real world coordination outside of systems run by closed vertically integrated organizations like companies, governments, etc. In the real world you don't know who your friends are. That's the point of Nakamoto consensus, which is AFAIK the only known partial real world solution to this problem. Unfortunately it's grotesquely inefficient and still has failure modes when you assume "irrational" actors willing to burn resources just to destroy the system. That's why I called it a partial solution. I still wonder if an efficient general solution with no subjectivity and no assumption of a sentient or AI operator who can intervene is possible.

This could potentially be better than Raft and friends for geographically distributed consensus or extreme scalability. Raft based systems ultimately end up being limited by the ability of the leader to lead a sufficiently large cluster. In practice with modern server-grade hardware this limit is quite high but it still exists.

noxer · on Dec 28, 2021

FBA does not require to know who your friends are you just need to know enough nodes. If some of the nodes are not your friend aka they lie to you, you would know as soon as they lie and you can simply stop listen to them.

This however requires identities. You cant just listen to anyone because someone could simply fire up 1000 lying nodes and if you pick 100 at random you'd have mostly lying nodes. You would figure out if they lie but then you have to stop listen to them all and you are offline. In practice this is not a real problem because obvious you would pick old nodes and not the 1000 freshly generated "identities". The assumption that you know nothing about any other node is a more or less made up thing for theoretical analysis that has no relevance in the real world. You dont start a network form zero you just want to join an existing one and you know its there so you might as well know who is there.

api · on Dec 28, 2021

Ahh, okay, but requiring identities means you must have a central authority to issue them or they must be extremely costly (proof of work or similar).

Of course this doesn't make this useless. If identities are all you need then you've devolved the role of the central authority to nothing more than something that says who you are. If there is no provision for centralized identity revocation then you don't have a censorship problem because once you have an identity you have it forever. Identities could also be anonymized keys where only the central authority knows the true identity, minimizing privacy exposure surface area.

Decentralization isn't boolean. This is more decentralized than Raft.

Edit: I suppose proof of work for identities could be less costly than continuous PoW. The work required just has to be high enough to make sybil attacks expensive enough to meet your threat model threshold.

noxer · on Dec 28, 2021

>identities means you must have a central authority to issue them

These are cryptographic identities. You create one by publishing the public key of your node.

There is no authority needed. You decide if you believable a given identities to be who they calm to be. Lets say HN would run a node they would simply publish the public key on their website and if it has been there for weeks you can reasonably assume the page wasn't hacked and the key actually belong to the operators of HN. (The key can also be signed with the domain cert)

It doesn't really matter that 100 other nodes could claim to be operated by NH too. There is no need for a central authority because every user can easily figure out which one is the real key. People can curate lists of nodes they "trust" and publish thous to make it easier for new people. Since everyone can publish such a list the curator of such a list is also no authority.

The obvious downside of this is ofc that only existing "names" will be recognized. If you just say here is my key and I have a cool website then most people would not add you for a long time. (You can still participate in the network.) But this is also wanted because everyone should only add nodes that are reliable, have the proper hardware and maintenance and all that. Basically being part of the community helps to exclude someone who just make a bunch of identities that look like they belong to different people. Even if no one adds your node to the list you can still participate by relaying info form other nodes since the data is signed by the origin node someone who gets the data from you can still use that data even if it does not use your nodes data.

There are a few ledgers running on these principles today most notably the XRPL[1] since 8+ years.

[1]https://xrpl.org/intro-to-consensus.html

withinboredom · on Dec 28, 2021

Doesn’t this mean you have to trust whomever made the list? Let’s say I have a pretty large following (I don’t) and I created my own trust list after accusing the authors of the “trusted list” or social engineering a situation that causes people to lose trust. Then I could take over the network with my “trusted” list. I’m sure people would figure out what’s going on, but now I’ve destroyed all trust in a trusted list and destroyed the network.

noxer · on Jan 8, 2022

You make the list for your node. There are published lists but they are only a recommendation by whoever published the list. You may use one or many lists and/or add/remove nodes how you see fit. So you dont need to trust them you should trust your own list. However its importation to understand that if you happen to add a few non-honest nodes they cant cause any real damage. Since your node checks all the rules and signatures itself it can not be tricked into accepting malicious transactions + once such a node would "lie" it would be publicly visible for anyone. The "lie" would be signed by that nodes identity. So everyone knows whom to remove from the list. Meaning every non-honest node has to act honest forever it will be forever "burned" because its known "evil" once it didn't act honest.

>I created my own trust list It would exist alongside every other already published list. If your list doesn't have a huge overlap with other existing lists its de-facto irrelevant because that would mean your list only contains unknown nodes and no one would use it.

>causes people to lose trust They dont have to trust the list publisher, damaging the list publishes reputation is somewhat pointless the list itself or better the nodes on there stay valid. Lets say there are nodes on the list that run since 8 years have low downtime and never once tired to somehow harm the network. That "trust" doesn't go away. In a sense time or age is a trust indicator that can not be forged. You would need to reasonably proof that the nodes identity actually got transferred to someone else who wants to use it to harm the network. And you would need to do this for every node on the list that you want people to remove.

>Then I could take over the network with my “trusted” list.

If your list contains mostly nodes that you operate/control. And a majority of the other nodes actually would use your list and your list alone (not combine it with the existing lists) then maybe you could reach the 80% quorum which could be considered "taking over". You could for example vote in an amendment to change the code the nodes run. But for that you have to publish the code first and then the amendment would need 2 weeks 80% votes to become active. Whatever is in that code if its bad and 80% vote for it anyway it would be obvious that the yes voting nodes are compromised and all the honest nodes, again, simply have to remove them.

Realistically no one would even install the software update if its not going through the official GitHub where the involved devs would want time for audits and all that before someone installs it. Also would most likely be installed on the test net which is centralized so it can be enabled fast and everyone can test it.

So the most damage someone could do if he successfully compromise lots of validator nodes is that he could stop the network possibly for several days because people have to remove the nodes manually to reach consensus again. It would of course be very bad but at the same time its incredibly unlikely to happen and there is not much incentive to even try. Its not like you could print money if you stop the network.

api · on Dec 29, 2021

If that's the case and there is no cost, I still don't get how this is resistant to sophisticated and well-resourced sybil attacks. What's to stop someone from just firing up 10,000 nodes in the cloud?

noxer · on Jan 8, 2022

No one listens to these 10k newly created nodes. If they all add each other to listen to each other they essentially create a fork as soon as they divert somehow from the rest of the network and no one would care. The attacker then has full control over the fork but again no one listens to it so its pointless.

You would need to convince a majority of nodes to listen to your nodes and stop listen to the nodes who dont listen to you. There is zero reason why anyone would do this. Even if you could bribe some to do it, it would likely never be close to 80% and it would be incredible hard to hide the attempt.

The number of nodes for each node to list to would need to drastically increase. If we do the math with 10 nodes to gain 80% control you would need to add 40 nodes so there is a total of 50 nodes and you control 80% of them. If you are one of the 10 nodes its obvious you would never add 40 new nodes in a short time. There is no benefit form adding many nodes anyway the goal is to have the most reliable and de-central nodes list possible not as many as possible.

https://xrpl.org/consensus-protections.html#sybil-attacks

isaacimagine · on Dec 29, 2021

10,000 nodes are only as powerful as the amount of trust the rest of the network lends to them. If nobody trusts those nodes, they will have no impact. If the nodes all trust each other and a single outside node trusts one of them, that node is worth 1 vote, not 10,000.

Under the Stellar Consensus Protocol, the people you trust is your 'quorum slice.' If you take a transitive closure of trust over all slices, you'll arrive at the set of nodes you rely on. The most trusted nodes will appear in a lot of quorum slices, and essentially form a core trusted quorum. An algorithm similar to PAXOS is run by all nodes, but the results in the core quorum will override outer nodes until consensus is reached.

Thus as long as no malignant nodes are in the core quorum, they will not impact consensus. If a node confirms something contrary to what everyone else confirms, you know it is malignant. And in the rare case that a malignant node becomes highly trusted (which would mean it isn't acting malignantly), then confirms contrary, Stellar prioritized safety and will come to a halt until trust graphs are readjusted to remove the bad actor (which removal can be done automatically if it's provably acting malignantly, i.e. signing invalid messages.)

erpellan · on Dec 28, 2021

Interesting! How does this compare to Serf / SWIM?

https://www.serf.io/docs/internals/gossip.html

https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/...

gritzko · on Dec 28, 2021

In a fundamental way: a node is not tracking all other nodes. The number of nodes may be unknown. A node only tracks its adjacent nodes.

isaacimagine · on Dec 28, 2021

What are the use cases of swarm consensus? Is it byzantine fault tolerant? (I think so?) If so, what component of the CAP theorem does it sacrifice? It it suitable for use in federated/distributed contexts (e.g. resistant to Sybil attacks), or is it more a replacement for PAXOS or Raft?

I bring up the last point because these algorithms (e.g. Raft) require knowledge of the full quorum, whereas Swarm seems to work on a slice of the full quorum. If it is suitable for distributed use, how does it compare to constructions of FBA such as SCP?

gritzko · on Dec 28, 2021

Very good question. The article clearly lacks a section on that. This form of consensus is made for a different set of trade-offs than Raft or Paxos or blockchains. It implies a swarm of participants of an unknown size and, mainly, CRDT data structures. That subject deserves an article of its own though.

Swarm consensus does not seek quorums, it has no knowledge of the number of nodes. It assumes there is a way to connect to the "giant cluster" in a highly redundant way. On the other hand, if disconnected (blackout etc), islands of consensus can progress despite that. So it is more on the AP side, like everything CRDT.

colinmhayes · on Dec 28, 2021

Doesn't seek quorums, so it's not consistent in the face of partitions.

davidrusu · on Dec 28, 2021

> n ∈ N (p), where N (g) is the set of gnomes who can hear g, including himself. Note that h ∈ N (g) ⇐⇒ g ∈ N (h).

> Each gnome g tracks the spread of the proposal in the following way: once all his neighbors n ∈ N (g) say their k-neighborhood (or bigger) is aware of the proposal

If I understand this correctly, a single node crashing would stall it's entire neighbourhood. @gritzko, can you confirm?

gritzko · on Dec 28, 2021

Nope. Missed a heartbeat means gone. We discuss churn later in the article. Its effects are complicated, but as long as d holds, everything else holds.

davidrusu · on Dec 28, 2021

Ok, the red flag to me is the reliance on all neighbours responding to a proposal.

Suppose we have a faulty node heart-beating correctly but not responding to proposals, or selectively responding to proposals.

This would seem to imply that dishonest nodes have full control over which decisions are made.

gritzko · on Dec 28, 2021

see later in the article, Ctrl+F "mushrooms"

davidrusu · on Dec 28, 2021

I will stop looking into this now, I have to get back to my work. The colorful language doesn't make it easy for someone who follows the literature to review your idea.

I still have doubts that you can handle a dropped message or how much control a faulty node has over decisions, but I'll leave that to you to work out.

rileyphone · on Dec 28, 2021

Some background https://m.youtube.com/watch?v=sSYeqZYgSUk

crawfordcomeaux · on Dec 28, 2021

How might this look if implemented by a group of people, instead of a group of computers?