GoSig BFT Consensus Protocol for Blockchain Simplified
Principal Software Engineer - 01 October 2018 -
Principal Software Engineer - 01 October 2018 -
Any blockchain must organize the transactions in a linear order of the time of their execution, which may be different from their time of submission. Instead of trying to achieve consensus on individual transactions, it is more efficient to group them in blocks which are a finite ordered list of transactions. If a block is accepted as valid and a consensus has been achieved, that means the same thing as achieving a consensus on all of the included transactions in the mentioned order. Blocks are ordered in a sequence which is aptly named as a block-chain or simply blockchain. The chain has a beginning called the genesis block. Everyone starts off with the belief that the genesis block is valid.
The distance of any block from the genesis block in term of the number of blocks in-between is called the height of the block. The height of the genesis block can be set as zero. The height of the first block after that is one, and so on. The job of the blockchain protocol is to make sure that every honest party agrees on the block that is acceptable given any particular height.
In a BFT protocol, each participating party sends messages to other parties in a prescribed manner until they come to a conclusion about a certain candidate block. There is the requirement that once an honest party is able to decide a certain transaction to be valid and agreed upon, no other honest party would deny that. Each honest party confirms a consensus by making sure that a large majority of other parties (honest or dishonest) agree to the same block for that particular height. The consensus algorithm is run fresh for each height, one after another.
Each party needs to know the public key of the other parties to validate their signatures. Otherwise, one single dishonest party can send the P-messages on behalf of everyone to create a consensus of his choice.
Blockchain BFT protocols provide a better way of achieving block consensus than the proof of work used in Bitcoin.
A theorem from Fisher, Lynch and Paterson rules out any possibility of achieving a consensus with even one faulty process in a completely asynchronous setting. So, in order to have a consensus, we assume some form of weak synchrony. Weak synchrony implies that the processes have almost synchronized time and the network is connected for some time during which the messages are guaranteed to be delivered within a specified amount of time.
BFT protocols have the following advantages over proof of work algorithm used in Bitcoin –
The only drawback is that the chain must know all of the miner’s public keys. This can be resolved by doing a Proof of Stake algorithm along with the consensus. Currently out of the scope of this blog.
GoSig is a BFT consensus protocol that provides security even in the public network.
In this blog, I will try to develop the protocol in steps to explain the concepts behind it.
As discussed in the previous section, the GoSig protocol must satisfy the 2 properties of consensus i.e Safety and Liveness.
Step 1 – A first cut BFT algorithm:
We assume that 2 third users are honest. Every user is able to make a choice somehow or the other. There is a possibility of not all choices are visible to every honest user when the network is fragmented or due to delay in the network.
Based on the above assumptions, every honest user should automatically agree on a choice as long as the network is connected.
Let’s look at what happens when the network is broken. For every height, any honest party does the following.
When the honest user receives 2f+1 votes from other parties for any particular choice for the block, he commits this block as the chosen one.
Let us see if the above algorithm satisfies the ‘safety’ property. If any block receives 2f+1 P messages, at least f+1 of them must be from honest parties, because the maximum number of dishonest parties are the only f. These honest parties will never send P message for a different block. So the maximum number of P messages that can be produced for any other block must be (3f+1) – (f+1) = 2f which is not sufficient for a commit. So, if some block is committed for a certain height by an honest party, no other block can be committed by any honest party. So the first condition is satisfied. This is great!
However, it might happen that the honest parties accidentally chose more than one different candidates for the next block. For example f honest users may P message candidate_1 and f+1 users may end up P messaging candidate_2. The attacker simply does not do anything and enjoys what’s going on the system. In this case, no choice of the block has sufficient P messages for commit, so the system is stuck. The second property of liveness has been violated even when the network is connected.
Step 2 – Network Deadlock
We need some way to get out of the above deadlock situation. The obvious idea is of course to restart the whole protocol over whenever there is such kind of a deadlock. The problem with this is its impossible to know when a deadlock has happened from just incoming P-messages.
Suppose we conclude that a deadlock has happened after receiving some s P-messages from different parties and the party still fails to commit. Now, let’s say in some case f honest users may P message candidate_1 and f+1 users may end up P messaging candidate_2. This is obviously a deadlock and hence these 2f+1 messages must be able to detect it and hence s ≤ 2f+1.
Suppose, now in a different situation 2f (group A) honest users have P-messaged candidate_1 and 1 (group B) honest user did not P-message anything yet (because he is yet to receive the possible candidates). Now, the attacker can send a P-message candidate_2 to group A which would convince the group A to abort (it has 2f+1 P-messages and still cannot commit, and s ≤ 2f+1). After group A aborts this and restarts the protocol, suppose they all decide candidate_2 (maybe because they have come to know of candidate_2 now). So group A now has 2f P-messages for candidate_2. The attacker now provides 1 P-message for candidate_2 so that group A can now commit candidate_2.
Notice that all this time, group B has not been able to either abort or commit any block. So, now the attacker can replay the recording of the first round for group A (showing evidence of 2f P-messages from group A for candidate_1 from his saved messages collected from the round) and adds his own 1 P-message for candidate_1. This provides 2f+1 messages for candidate_1 to group B and hence group B commits candidate_1. In effect, group A committed candiate_2 and group B committed candidate_1 for the same height. This breaks our condition 1.
The easiest solution to this problem is to timeout around after some reasonable time if a commit cannot be achieved before that. As long as the clocks of each processor are more or less same, this should make sure everyone is in the same retry round. This assumption is a synchrony assumption called partial synchrony. We will go into more details about it later. For the time being, let us just assume that all nodes maintain the exact same time and message propagation time is zero for any message.
The next proposed solution is very simple.
It solves our liveness problem, but the safety property is broken now. Suppose 2f+1 parties send P-messages for candidate_1. The messages reach only 1 honest party who commits candiate_1. The rest of the user waits until R time has passed and restarts the process where they can easily commit candidate_2 with one P-message from the attacker.
Step 3 – Adding tentative commit of TC messages:
To avoid both of the problems, we need to put one extra step of messages that we would call the tentative commit or TC messages. The idea is to separate P-messaging for a different block in the new round and actually committing a block in the previous round. Sometimes a user needs to observe the new round while he still did not decide anything for his current choice for the candidate block. Every honest user does the following in every round.
This was much more complicated than the previous one. This is mostly because it has a new step and lots of nuances. Let’s see how this satisfies the safety and the liveness properties.
Safety property: We need to prove that if any honest party commits a block, there would be no more commits at the same height. This ensures that there is only one possible commit for any height.
When an honest party commits a block, he must have observed 2f+1 TC messages for the same block. Since the attacker can only send f of them, f+1 TC messages must have come from honest users. These honest users will never P-vote for the same height for a different block. So, any further rounds can only have a maximum of 2f P-votes for any other block which would not even be sufficient for a TC message. Without any TC message for any other block, no honest user will commit any other block at any further round, proving our case. Two different commits cannot, of course, happen at the same block because of the fact that it is not possible to make TC message for two different candidates at the same height. If basically proves that even though honest parties can commit a block for the same height in different rounds, they would always commit the same block.
Liveness property: We need to make sure that no matter what happens in the network and message order and whatever the messages sent by the attacker, when the network recovers, all honest users will commit some block. We break it into the following cases.
Now that we have a way to achieve consensus, I would delve into how block proposals are created by any party.
Creating the block proposals:
Given that blocks can be proposed by anyone, we need to handle the following situations
To tackle the first problem, we allow one party to suggest only one candidate per round, and that too only after a commit happened for a previous height. So, to suggest a candidate, any user has to first prove that there was a commit that happened in the previous round. Since a candidate will eventually but definitely get committed after 2f+1 TC-messages for the same has been broadcasted, we require that those TC-messages be presented while suggesting a new block for the next height. So, any valid new candidate suggestion for the height h must contain 2f+1 TC-messages.
To tackle the second problem, a small set of leaders are selected at random. This ensures that less number of blocks are proposed, thereby increasing the probability of commit. The random selection must be publicly verifiable to be random, otherwise, the attacker will simply choose random numbers that favour him and simply claim that he got those numbers randomly.
One way to get a publicly verifiable random number for any party would be to simply concatenate current height, current round and the party’s public key and take the hash of it. Since cryptographic hashes have uniform random distribution, this makes a good source of the random number. We say a party has been selected as a leader if his random number for the round is less than a certain value which is chosen based on the fraction of parties to be chosen as leaders per round on average. If the maximum value of a hash is N and the fraction of parties chosen as a leader be x, then the condition for selection would be H<x/N, where H is the random number for the given party (H=Hash(height||round||publickey)).
Although the above idea works, Jing Chen et. al., the creators of the Algorand algorithm had a better idea. They used a technique called a verifiable random number that has all the properties of what we had suggested, and it cannot be known to any other party before the owner of the private key discloses it. Instead of taking the hash of the concatenation of the height, round and public key, they start with the hash of the concatenation of height and round. They then use a unique signature algorithm to sign this hash using the party’s private key. A unique signature has no random element and there is only one valid signature for a single message. RSA is one such signature algorithm. Another is quite a new one called the BLS signature. They then sign the hash using the party’s private key. The random number H=Hash(S) where S=Signpry(Hash(height||round)) . The signature S is included with the random number as a proof so that it is publicly verifiable that the hash is correctly generated and the generator has no way to control it. Since the signature can only be generated by the owner of the private key, other parties cannot know about it until he discloses it.
The leader must include the signature as part of his candidate suggestion. This makes sure that the leader cannot be identified before he already made the suggestion. That way, he cannot be corrupted by the attacker beforehand.
To increase the likelihood of every honest user P-messaging the same candidate, we need to find a deterministic way to choose between multiple candidates. A simple rule can be that we choose the one with the minimum value of the verifiable random number in the first suggestion.
Now that we have a working BFT algorithm, I would like to dive into the details of network delay and clock delay.
Handling delays and timing tolerance:
Let T be the maximum difference between any two parties, S be the maximum network delay for receiving a message when the network is connected, R is the time for each round, M be the maximum processing time for each round. We consider the following under full network connectivity.
So in total, R must be greater than T+3S+M. This will always ensure a commit under full network connectivity while allowing the system to work fast.
The correct protocol I have described here is the Gosig protocol proposed in the paper https://arxiv.org/pdf/1802.01315.pdf by Li, Wang and Chen (I changed a few little details, but they don’t matter). We went step by step into developing a working BFT protocol that can be used in a blockchain. There are quite a few others with somewhat different approaches, we do not cover them here.