We’ve been talking for a while now about the refactored process for electing new section elders and managing splits. This week @anselme digs into the thinking behind the belt-and-braces approach to this.
General progress
@Chriso is refactoring the processes for signing chunks and registers and testing storage and DBC double spend.
@roland is working up tests for the sn_sdkg
AE and membership functionality described below.
Mostafa has been researching decentralised consensus mechanisms and is reviewing our current setup.
@davidrusu is reworking the join process to include network knowledge update before join.
And @anselme and @dirvine are refactoring the authority process as described last week.
@bzee, @bochaco and @joshuef are working on debugging lower level stability. We’ve seen a combination of slow message handling and bugs in connection management causing issues on occasion. So we’re taking a deep dive here. We’ve made some progress, simplifying how messages are handled and are keeping on on this front, as it seems like it’ll be quite a promising refactor, both in terms of code simplicity, but also stability.
All about sn_sdkg
DKG with sn_sdkg
- Based on poanetwork’s key generation in hbbft
- Removes timers
- Introduces Gossip
- Embraces concurrent DKG
- A race between DKGs to Handover
We just integrated sn_sdkg
(Safe Network synchronous distributed key generation),replacing the previous version where we saw failures and timeouts when networks were slow. The root cause is probably that we are using timers. This new DKG process does not use timers, in fact, we don’t want any timers on Safe, anywhere, at all.
DKG is not active, it’s just something that nodes do in the background when they receive system messages, and they can run multiple sessions concurrently.
As well as using anti-entropy (AE), sn_sdkg
also introduces gossip, to make sure nodes receive any messages they might be missing, so that everyone is eventually consistent.
Let’s walk through it.
Elders send DkgStart
- section split
- section handover
The first step in the process is the elders send a DkgStart
message. This happens when one of two situations arises. The first is when a node arrives in a section that’s older than one or more of the current elders. In this case we have to handover - the new node is promoted and section knowledge passed on. The second scenario is when the section gets too big and splits.
pub struct DkgSessionId {
/// Prefix of the session we are elder candidates for
pub prefix: Prefix,
/// Other Elders in this dkg session
pub elders: BTreeMap<XorName, SocketAddr>,
/// The length of the section chain main branch.
pub section_chain_len: u64,
/// The bootstrap members for the next Membership instance.
pub bootstrap_members: BTreeSet<NodeState>,
/// The membership generation this SAP was instantiated at
pub membership_gen: Generation,
}
When there’s a DkgStart
event, the elders generate information in a structure called DkgSessionId
(the format is called a ‘struct’ in Rust). They sign this struct with their BLS key share and exchange it with the other elders. Each DkgSessionId
provides baseline information about the current state of affairs and is unique to a DKG round.
The prefix is the current section prefix. In the case of a handover it doesn’t change. In the case of a split it does, and we add a one or a zero to our current prefix depending if we’re on the 0 or 1 side of the split.
The elders
field contains all the new candidate elders for this DKG session, nodes with a node age equal to or greater than that of a current elder.
section_chain_length
(to be renamed) is the distance from the current section key back to Genesis. We can think of it as the “section generation”.
membership_gen
also shows the membership generation. It is different from the section generation. Each section has many membership generations and the chain length does not change.
bootstrap_members
are the current elders in the section.
Once nodes receive DkgSessionId
s from all the current elders they can start a DKG session.
Let’s go!
Dkg Steps
flowchart LR;
A[DkgStart] --> B[Ephemeral Key]
All current and candidate elders receive a DkgStart
message that’s addressed to them and in which they are included as elders
in the DkgSessionId
struct.
Each of these nodes then generates an ephemeral one-time BLS key that’s only used for this DKG round. This ephemeral key is signed with their ed25519 key, which serves as their identity, so the other nodes can check if it’s valid.
Dkg Steps
flowchart LR;
A[DkgStart] --> B[Ephemeral Key] --> C[Voting]
Next comes the voting stage in which each node can generate their own keys for voting. We use a process borrowed from Poanetwork’s key generation in hbbft which ensures that each node can only generate its own key and not use information from other nodes to generate a separate key for the DKG round.
The nodes are voting on which among the candidates and existing elders should now be the section elders.
Dkg Steps
flowchart LR;
A[DkgStart] --> B[Ephemeral Key] --> C[Voting] --> D{Key Generation}
Once voting is terminated, the candidate generates a new key that it sends to the current elders to prove that it is eligible to become an elder (because a supermajority of nodes have signed it with the newly minted key). So DKG is a sort of test for a good candidate. If they don’t finish DKG, they are not good enough.
Concurrency
flowchart LR;
A[DkgStart1] --> B[Ephemeral Key] --> C[Voting] --> D{Key1 Generation}
flowchart LR;
A[DkgStart2] --> B[Ephemeral Key] --> C[Voting] --> D{Key2 Generation}
Multiple DKG sessions can happen at once. Every time there’s a new member in the section we might (if the new node is older than any of the current elders) have a new DKG session, so if while one session is ongoing a new node joins, another DKG session can start. Some will terminate, others will just time out and fail. It’s a race, and we pick the first one that finishes. Later, if it turns out that there’s an older node that’s not an elder, the process starts again.
Resolved by Handover
stateDiagram-v2
Dkg1 --> Handover
Handover --> Winner:Dkg1
Dkg2 --> Handover
Dkg3 --> Handover
If there is a tie, we only want one winner, i.e. one new elder. Handover is a process that uses consensus to decide which one should win.
Active AE
stateDiagram-v2
DkgStart --> EphemeralKey
EphemeralKey --> EphemeralKey: Includes DkgStart AE
EphemeralKey --> Voting
Voting --> Voting:Includes EphemeralKeys AE
Voting --> Termination
In these stages mentioned above, receiving a DkgStart
message, creating an ephemeral key and voting, we need to make up for network issues and slow nodes. For example, we need everyone’s keys to start voting, if not we get stuck. So we include AE in every key-related message to provide information that a node might be missing. AE is lightweight and an efficient way to make sure all nodes are up to speed.
enum SystemMsg {
DkgStart(DkgSessionId),
DkgEphemeralPubKey {
/// The identifier of the DKG session this message is for.
session_id: DkgSessionId,
/// Section authority for the DKG start message
section_auth: AuthorityProof<SectionAuthProof>,
/// The ephemeral bls key chosen by candidate
pub_key: BlsPublicKey,
/// The ed25519 signature of the candidate
sig: Signature,
},
DkgVotes {
/// The identifier of the DKG session this message is for.
session_id: DkgSessionId,
/// The ephemeral bls public keys used for this Dkg round
pub_keys: BTreeMap<XorName, (BlsPublicKey, Signature)>,
/// The DKG message.
votes: Vec<DkgSignedVote>,
},
DkgAE(DkgSessionId),
...
}
In the code, a System Message, which is the message type used for processes that might change the network structure, looks like the above. This encapsulates all the processes we’ve talked about.
Resilience
- Active AE
- Voting AE
- Gossip
- when waiting for votes
- when receiving outdated votes
To wrap up then, we have multiple mechanisms for resilience when handling handovers and splits.
Active AE means that nodes are sent information from the previous step in case they missed it.
We also have gossip as a back-up. If you’re in a DKG session, you’re ready to receive ephemeral keys or votes from somebody. If they don’t show up you send your current knowledge to the others in the hope they will update you. Also, if you receive out of date AE from someone, you can update them.
So gossip has two purposes. To make sure we are up-to-date so we can move forward, and to update others, so we can get to termination.
Together with AE and allowing for competition between DKG processes, this provides a much more robust process for promoting new elders and handling splits. During splits we have two DKGs going on rather than one.
Useful Links
Feel free to reply below with links to translations of this dev update and moderators will add them here:
Russian ;
German ;
Spanish ;
French;
Bulgarian
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!