We have simplified the code and eliminated many glitches by removing all unnecessary asynchronous multi-threaded processes in the node code. At the same time, some issues with slow communications remain, which we think maybe caused by Quic itself so we’re digging into that.
Outside of the bug bashing, the new synchronous sn_sdkg
code is now ready for integration. @anselme walks us through that this week.
General progress
Qi_ma is looking at how elders check a node’s node_age
when they are deciding which to relocate to a new section, as we are seeing some anomalous behaviour there, including ‘zombie’ nodes being able to join the network.
@ChrisO is experimenting with Quinn, which is a Rust implementation of the Quic protocol for establishing and maintaining connections between peers. Unfortunately, Quic is something of a black box, and we think some of the connectivity issues we are experiencing may be down to the way it works, specifically that communications are often slow, which causes problems when processes time out. Chris has a Quinn sandbox set up and is seeing what happens when we fire different types of messages at it. At the same time, @bzee is looking at the structure of the qp2p communications to confirm we are using Quinn as efficiently as possible. The key issue is receiving concurrent streams asynchronously while allowing waits on responses (returning a watcher, to watch for response messages on the same channel).
@roland is working on fuzz tests for the new sn_sdkg
crate, which @anselme describes in detail below.
sn_sdkg Integration
Distributed Key Generation (aka DKG) is the way section elders generate the section key in a secure way that keeps the section key secret. At the end of DKG, each elder knows only their own secret key share, that way nobody ever sees the entire section secret key. This is a mechanism to mitigate the action range of potentially bad elders: it’s how Safe Network can ensure that as long as we have less than 5/7 bad elders, they can’t sign anything with section authority. Section authority is required to change data, promote or denote nodes and to mint tokens, so this is very important.
Recently, we’ve been working on a new DKG that is more resilient to packet loss and that doesn’t use timers, so it can’t fail because of slow network traffic and timeouts. This post describes how this new DKG works. For this implementation, we use the synchronous distributed key generation sn_sdkg
crate, which is based on poanetwork’s Synchronous Key Generation algorithm in their hbbft
crate.
How DKG works
DKG is triggered by the elders when they notice that the oldest members are not the elders, or when a section splits and they need to chose elder candidates. As they notice this, the current elders ask the candidates to start a new DKG session with a DkgStart
message, so they can generate the next section key.
The first step in our DKG is generating temporary BLS keys, which are used for encryption in the DKG process. Every node on the Safe Network has an ed25519 key, but although those keys are great for signatures, we can’t safely do encryption with them. We need another way.
Since our nodes don’t have BLS keys (elders have a BLS keyshare but not a simple BLS key), we generate a one-time key just for this DKG session and discard it after use. However, we need the other nodes to trust this BLS key because it’s brand new, so before anything else happens, each candidate broadcasts its newly generated BLS public key in a message that contains their signature (made with their trusted ed25519 key) over the new one-time BLS public key that they will use for this DKG session.
Once the candidates have all the BLS public keys for this DKG session, they can start voting. Voting has 3 stages:
-
Parts
: every node submits aPart
that will be used for the final key generation, it contains encrypted data that will be used for generating their key share. -
Acks
: nodes check thePart
s and submit theirAck
s (acknowledgements) over thePart
s. TheseAck
s will also be used for the key generation. -
AllAcks
: everyone makes sure that they all have the same set ofAcks
andParts
by sending their version of the sets. This last part is there to make sure that the candidates end up generating the same section key!
Once voting is finished, candidates can generate their secret key shares along with the new section public key from the Parts
and Acks
.
Gossip and eventual termination
On a network, messages can be lost and that can lead to situations where some candidates are missing votes and some are waiting for a response to votes that never arrived. To counter this problem, we have gossip! Every now and then if a node hasn’t received any new DKG messages when it is expecting some, it will send out all its votes to the others. This has two purposes:
- one is to inform the others of the votes, and get them up to speed with votes they might have missed
- the other is to show the other participants that our node is missing votes, so others can respond in turn with their votes and help us catch up with them
Indeed, if a node receives a gossip message that is missing information, it will respond with its knowledge. This happens even after termination (completion of the voting round), because sometimes, when a node terminates (and thus stops gossiping because it is not expecting any more votes), it will still receive gossip from other nodes that didn’t make it there yet. In that case, the knowledgeable node will respond to this gossip with their knowledge so the other nodes can also reach termination. Eventually, through this process, every candidate reaches termination.
Concurrent DKGs
In this implementation, we embrace concurrent DKGs. Sometimes, right after DKG is triggered, a new node joins the section and appears to be a better elder candidate because its node age is very high. In this case, the current set of best elder candidates changes, and the current elders issue another DkgStart
message to the new candidates.
The previous DKG session is not stopped, instead, now it’s a race between the two! We want elders to be very reliable nodes. In a way, the intensive DKG process is a test to check that these candidates are indeed fit to be elders. If multiple DKGs terminate at the same time, it’s fine, Handover Consensus
will make sure the current elders pick just one winner. DKG sessions that didn’t win the race might or might not terminate, but it doesn’t really matter, they will eventually be stripped out as the nodes realise that they lost.
Conclusion
In short, the new DKG focuses on being very resilient to messages loss, removing the need for timers and making sure everyone reaches termination eventually without possible timeouts. It also makes concurrent DKGs a feature to select the best candidates in a race to termination between DKGs
Useful Links
Feel free to reply below with links to translations of this dev update and moderators will add them here:
Russian ; German ; Spanish ; French; Bulgarian
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!