Elders use Distributed Key Generation (DKG) to decide between two possible alternative courses of action. Each elder generates a key-share and sends it with its vote, and if a supermajority (5 out of 7) of these key-shares are received they are aggregated to create the full key required to validate the decision. We developed our own system to do this using BLS cryptography, which works fine most of the time - but not all of the time. Unfortunately when it fails the network can grind to a halt. This week @anselme provides a beginners’ guide to DKG, and he and @davidrusu explain how we are looking to make our DKG implementation more robust.
General Progress
@davidrusu and @joshuef have been digging further into making sn_node single-threaded as far as possible to get rid of the problem caused by locking. This will require a little refactoring, moving Comm - the code that handles messaging between nodes - out of sn_node
so it can stand on its own better and potentially have its own thread.
Meanwhile, @bochaco has been working on the minting of the initial genesis DBC by the first network node, the handling of spentbooks so that nodes don’t need to confer with other sections to see if a transaction is valid, and other steps in DBC integration.
@bzee is getting stuck into information dispersal algorithms (IDA), which, as you may remember, is how we are reducing the storage and bandwidth requirements of the network, and @roland is working on some write module tests for DKG and assisting @Anselme with the new sn_sdkg
crate (s
here meaning synchronous
).
Towards a more stable DKG for Safe
Distributed Key Generation (DKG) is necessary for the Safe Network to ensure that each elder in a section has a unique shard of the section key: the key-share. It allows the network to generate a secret key for each section, without ever having a single node know the entire section key. This makes the Safe Network more resilient to Byzantine nodes as it doesn’t need to trust all the elders.
The current DKG implementation has a flaw that’s been bugging at us for many months: The use of timers. DKG requires full participation from all nodes who wish to be able to sign. In times of high network activity, nodes can become backed up and fail to participate in DKG sessions in a timely manner.
With our current implementation, if nodes take too long to contribute to DKG, then the DKG session will fail. We’ve been keen to remove any time-related behaviours in Safe Network, embracing concurrent and asynchronous flows as a default.
We’ve been looking for an alternative to timer based DKG for a while now and we think we finally found something that will work by building off of the Membership and Handover work we’ve done.
DKG for the rest of us
Although DKG can seem like a very technical and complex thing, I feel like its usefulness and basic mechanisms can be explained to a 6-year-old.
Imagine a kingdom with a king. The king has an ivory stamp that no one can fake. Any letter with the king’s seal of approval is seen by the whole kingdom as trustworthy.
ivory stamp == secret key
seal of approval on a letter == letter with the king's signature
One day the king realises he is old. The king has four heirs, and he wants to be fair, so he asks his master craftsman to break his ivory stamp into four equal pieces that can each print out a part of the king’s seal. He then gives out one of these tiny stamps to his four children. Now in order to recreate the king’s seal of approval, the children need to use their tiny ivory stamps and each print out their seal parts next to each other until one can see the full seal again.
4 ivory stamp pieces == multisig
tiny stamp == key-share
After a while the heirs realise this mechanism is slowing the kingdom down because requiring all the seal pieces every time is a tedious operation, some of them are away and travelling at times, so getting all the seal parts can get pretty complicated. So they have an idea: they agree that as long as one can see more than half of the king’s seal on a letter (3/4+ seal), the seal is considered to be valid and carry the kingdom’s authority.
3/4+ seal == threshold cryptography
One day the heirs realise the master craftsman was dishonest and kept a copy of the king’s original ivory stamp and that he was selling them throughout the kingdom to some evil doers. The heirs then decide to create a new stamp for their kingdom. But in order to avoid last time’s mistake, they each hire a craftsman to create a single new piece of a new stamp. Each heir has their own seal that they created on their side. By joining them together they obtain a new royal seal that is unique and that no single stamp can fake. They decree that the union of more than half of these seals represents the kingdom’s authority. Since no one ever saw the full stamp, they can be sure that no one will cheat this time.
separately created stamps that make up a unique seal == DKG
In the Safe Network, DKG is used by elders to generate the section key that carries section authority. In SN terminology, the genesis key (key of the first node in the network) is like the king’s stamp. Once more members join, the genesis node and new nodes perform DKG to create a new section key. They each create their own key-share (tiny stamps), but they have never seen each other’s keys, so no one ever has the entire key in his hands. Every time there is a change of elders (the heirs), the nodes perform DKG to create a new section key. This is pretty safe because one would need to control a supermajority of the elder nodes in order to have section authority (more than half of the kingdom’s seal).
Poanetwork’s DKG
Poanetwork’s hbbft (Honey Badger Byzantine Fault Tolerant consensus algorithm) has a rather straightforward implementation of DKG that does not make use of timers. Their code has been audited by some knowledgeable crypto people, so that’s also a big advantage over ours.
HBBFT’s key generation algorithm in a nutshell:
- Participants each generate
Parts
that they broadcast to the others - Participants check all the
Parts
and generateAcks
(acknowledgements) for each of them, also broadcasted to the others - Finally, they can each generate their own keyshare from the
Parts
andAcks
There is a catch though! They all need to process the same sets of Parts
and Acks
!
From their docs:
If Alice’s Ack was handled by Bob but not by Carol, Bob and Carol could receive different public key sets, and secret key shares that don’t match.
One way to ensure this is to commit the messages to a public ledger before handling them.
Although there are no timers involved, this algorithm does require nodes to wait for each other.
Integration with Safe
We need to ensure that all participants have access to the same sets of Parts
and Acks
. Poanetwork’s docs suggest using a distributed ledger like a blockchain to ensure that all participants have the same sets. We don’t have a blockchain in Safe - but that’s fine. Our team had an idea with broadcast signed messages that doesn’t require full consensus! It works as follows:
- Send
Parts
in a signed message - Wait until all 7
Parts
come in, broadcast the signed set - Wait until all 7 identical signed sets of
Parts
come in - Send
Acks
- Wait until all 7
Acks
come in, broadcast the signed set - Wait until all 7 identical signed sets of
Acks
come in - Once a node receive 7 signed sets with the same
Acks
, they can Generate the Key Share from the agreed onParts
andAcks
The fact that we have that additional round of all-to-all broadcast to get all nodes to see that all nodes have seen the same Parts
and Acks
makes this algorithm resilient to a certain amount of Byzantine nodes. It is also easy to prove that a node cheated by showing two different and contradictory messages that they signed.
Although this simple algorithm makes sure nodes agree on the same Parts
and Acks
, it does not guarantee termination in case of a certain amount of message loss. That’s fine, we embrace it and don’t see it as a failure.
If enough messages are lost, we assume that this set of candidates was not fit to become elders in the first place. At every churn event (node joining or leaving a section), the elders check if the oldest 7 members in the section are the current elders. If that is not the case, they ask the oldest members to start DKG. This process is called Handover.
Churn events can happen at quite a fast pace, so multiple concurrent DKGs are possible. It is a race between multiple DKGs, and whichever finishes first wins the right to go through the handover process. Many DKGs might be left behind either not finished yet or stuck due to message loss, but that’s fine, it’s part of the game. Each node keeps their current DKG statuses in a temporary set that is reset at every Handover.
This algorithm makes DKG a passive process: it is only message handling. There are no timers, no fetching data on other nodes, just a temporary set of ongoing DKGs waiting to be completed or discarded after the next handover.
If for some reason several messages are dropped and a DKG cannot be terminated this is easily handled. We can leave it to Dysfunction to eventually track down inactive/slow nodes and vote them off. This in turn triggers churn, and that churn triggers a new DKG round.
Links
Useful Links
Feel free to reply below with links to translations of this dev update and moderators will add them here:
Russian ; German ; Spanish ; French; Bulgarian
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!