Data Chains Summary
There was an immense amount of information contained in the latest Dev Update about Data Chains. The linked documentation contains a lot of implementation detail, and I felt a higher level overview would be useful.
These are the original sources
- Dev Update 2017-05-25
- Data Chains - General rules in eventually consistent network
- Data Chains - Option A
- Data Chains - Option B
Goals
-
“improve the robustness and security of the SAFE network”
-
to achieve “an eventually consistency network” using group consensus
-
have a consistent network but still (necessarily) “handle out of order messages” for each individual node on the network.
-
“make sure that nodes agree 1) which changes to the section’s membership happened and 2) in which order they happened.”
-
resolve any differences between “The truth we see” and “The truth the group agrees”
-
To solve the problem that “the existing nodes in the network don’t always agree on which events [affecting the group] have happened yet or should happen”
New Benefits
Data chains bring some new benefits:
-
“ensure that all correctly-functioning nodes eventually agree on the nodes in their section”
-
“an external observer can use the chain to verify that a section is authentic”
- When a client disconnects from the network and reconnects at a later time, the network it connected to is verifiably the same as the one it had been connected to before.
- “A chain with missing events will be detectable in cases where multiple copies are requested from different current group members.”
-
“provide cryptographic proof of what data is stored on the network”
-
provide cryptographic proof of “who the legitimate members of the network are”
-
“no node needs to know about the entire network or chain” but the entire network can still retain integrity.
-
“On network restart we can accept a chain as containing valid information from a node”
How
-
Record changes to the structure of nodes in the network in a sequential order (ie a chain of events), so the history of the network structure can be known accurately. “The blocks are linked to each other by every block containing a hash of its parent block, thus forming a chain that can be cryptographically verified by any node.”
-
The priorities (in order of most to least important) are to protect the network, protect your group, protect yourself. “These basic tenants will mean a node will self terminate to protect the group, or destroy the group to protect the network.”
Chain Storage
-
“this first part does not yet concern data and focuses on the changes of membership and the network’s structure only”. Possible events that may be included in data chains are:
- Genesis, the initial network state [no parent]
- Add, recording the addition of a node
- Remove, recording the removal of a node
- BeginMerge, recording the final state of a section before merge
- CompleteMerge, recording the (re)creation of a new section [two parents]
- BeginSplit, recording the final state of a section before split [allows two child blocks]
- CompleteSplit, recording the (re)creation of a new section
- BeginForceMerge see Handling Broken Sections below
- CompleteForceMerge [two parents]
-
“Routing nodes must store [a] subset of the chain in their cache” - this increases the resources required to run a routing node (compared with not using data chains), however “crude analysis suggests the storage size is not prohibitive, though transfer size on startup likely would be”
-
The amount of recent chain history to be stored by each node “needs to be large enough to make the probability of nodes falling behind by more than that number of blocks negligible, without creating disproportional overhead. For now, we estimate that 20 should work well.”
-
“Where prioritisation of multiple valid blocks is required [at the same time], blocks are ordered … from most urgent to least urgent with the goal of stabilising the network as soon as possible”. The order of events from most to least urgent is BeginMerge, Remove, Add, BeginSplit.
-
“blocks are proposed by a particular member called the primary, and are only committed to the chain if agreed on by a quorum of members”. This means the section nodes all have a consistent view of the order of events, regardless of the order each individual event arrives at each individual node. “Until the group (or specifically) quorum of group members agree with their vote then the event is not agreed.” However it is also said that “any node can make any proposal [for a new block] at any time and send it to all group members” so there seems to be some mild conflict here between proposed design and implementation.
-
Nearby sections are also monitored by the data chain - “Blocks from neighbouring sections are added to our chain if valid in the same way as those from our own section, and if valid trigger updates to the corresponding sections in our routing table.”
- “It is not essential to ensure that section updates reach every neighbouring node… the processes are designed to tolerate small discrepancies in the version of neighbouring sections.”
- “Other updates from neighbouring sections (to add, remove or split) could trigger a split or merge in our own section, and may cause us to send connection requests or disconnect from nodes, but otherwise do not directly affect our section.”
Error Handling
- “Several constants are needed for broken section handling”
- “Determining optimal values for these parameters will be done experimentally.”
- If good health is not indicated by a neighbouring section within the specific heartbeat period, that neighbour is considered to be broken.
- “we establish which nodes of [the broken section] are still reachable” and attempt to merge with those reachable nodes to form a new section.
Attacks
-
“In order to commit a fraudulent BeginForceMerge, the section p1 must have enough malicious nodes to satisfy quorum, thus these nodes can commit a CompleteForceMerge with any list of “still reachable” nodes they want. Since the compromised section can commit any Add block in the same way, an attacker with a single compromised section could add enough malicious nodes to satisfy the workload and use force merges to take over neighbouring sections, dropping enough other nodes to maintain a compromised quorum.”
-
“Security of the network depends on each section behaving correctly, which is ensured by the network distributing nodes throughout the network automatically, such that an attacker cannot choose where in the network any nodes he runs will join, and a brute force attack would require running a significant proportion of the nodes in the network to have any real chance of success… The broken section recovery mechanism could be used as a weapon allowing a section controlled by an attacker to take over more of the network, even up to the point of taking over the entire network if the attacker had enough resources to handle network load; this is a known limitation, but necessarily any mechanism to recover from a state where no consensus is possible must reduce security. This does not necessarily mean the system is less secure than a network using a less strong form of consensus, and is only an exploit if an attacker can control one section in the first place.”
-
“the extra message handling required to establish consensus on each new block may add some delay. This is not seen as a major concern because… the work involved in handling client requests is expected to be far greater than the work involved in managing churn”
-
“As the vote is signed then nodes who are acting maliciously have little effect, however their actions are recorded by default and these actions are digitally signed. Therefor bad or invalid behaviour is detectable and cryptographically provable. This detection and any punishment is not discussed in this paper.”
-
“A critical question is what to do with misbehaving nodes. These are nodes that do not vote, or frequently propose (vote) for events that we do not agree with. To maintain integrity bad nodes will require to be punished and possibly removed from a group. If the punishment (such as Kill the node or immediately relocate with age/2) is too harsh it would de-stabilise the group quickly, too lenient and it will lead to too high an error rate and affect our decision making. To penalise or at least handle a mis-behaving node, we need to measure it against an agreed behaviour. As we know nodes will recieve messages and events out of order we do need to handle the case where a node is trying to keep up, but maybe slightly slow. This slow node may be valuable, but if too slow, then it’s a danger.”
Difference Between Option A and B
I found it hard to differentiate the features of A vs B, let alone the pros or cons. Anyone who can shed more light on this would be appreciated.
-
“Option B proposes a mechanism that computes the validity of blocks from a set of votes that is independent of the votes’ order.”
-
In Option B “more than one state could be valid at the same time”. Events are proposed and gradually voted on, until they go from being ‘proposed’ to ‘confirmed’ (although the terminology used in the document is to go from being ‘current’ to ‘accumulated’).
-
Presumably the above points differ from Option A in that for Option A only one block may be proposed and voted at a time, making it a more sequential process. Some additional clarification on this would be welcome.
Personal Observations
-
I don’t understand how the entire chain of blocks is stored. The use of MIN_CHAIN_DEPTH seems to indicate that most nodes will store a subset of the chain that does not extend back to the genesis block. Some clarification here may be helpful to me.
-
Initially the use of the word “block” seemed misleading to me. “A block is an element in the chain, describing one change.” - why not call it a change or an event? That’s what it is after all. Later in the documents a block is defined as “an agreement of several nodes. This agreement is digitally signed by each node.” This makes more sense to be labelled as a block (ie a block of signatures). I feel the use of the word ‘block’ too easily invites comparison with blockchains, which is a great disservice to the technology of data chains. I would suggest the word ‘change’ or ‘event’ or ‘delta’ may be more appropriate. A minor point for sure.
-
I’m not convinced the work involved in handling client requests will be “far greater” than the work involved in managing churn. Churn is an obvious attack vector and I suspect churn efficiency will be a major factor in network security. This is just a gut feeling of course, and I concede the benefit gained by such an attack would be so small it would probably be infeasible.
-
The language used in Option B is much more technical and detailed than the other two documents. It was a very difficult read for me. I think Option B is a more natural design for the voting mechanism (using a ‘frontier’ of possible chain extensions rather than a strictly serial / sequential vote system via a Primary node). But I think this is also much more complex to design and develop. So my preference is to implement the simplest thing that works (Option A) with a mind to be able to improve it later (turn Option A into Option B via a series of future optimisations). Maybe this indicates I’ve misunderstood the difference between the two options!
The documentation represents an enormous amount of work and is extremely impressive. A very heartfelt congratulations to the people who created it. I feel this design and documentation work is as important (and difficult) as implementing the product features and code and the amount of progress it represents should not be underestimated.
I’m new to these ideas so any corrections are more than welcome!
If you have your own summary I would be very interested to read it. Data Chains present complex ideas that benefit from being explained at a few different levels of abstraction.