As most here will doubtless know, adults are nodes that store data and give it upon demand. But what if they start acting childish, refusing to store or give up data, or at least doing so slower than expected? For the sake of the network we need to demote or eject such wayward nodes, but before we do so we must redistribute the data they’re holding. We also need to provide meaningful error messages to clients and other nodes trying to store data when that fails. That’s what we’re delving into this week.
General progress
@bochaco has been working on the safe shell. If you type safe
into the console (once safe_network
is installed of course), you enter the shell, meaning you don’t need to type safe
every time thereafter. With all the recent CLI updates this aspect has been a bit left behind, so he’s been putting that right. As well as attending to some tidying and refactoring in the node code in preparation for the upcoming membership changes.
In the DBC labs, @danda is working on integrating Ring CT into the network, including making DBCs more user-friendly for use with the two kinds of keys: a long-lived base owner key for interacting with third parties, such as for donations, and a derived one-time-use key for interacting with mints and spentbook. He’s also working on test features that can be turned on or off for debugging and optimisation, and has reduced the number of calls required to iterate over the spentbook.
On data replication duties, @yogesh has made progress on a pull model where adults will be told what data they should be holding and will start to pull data from the network automatically to ensure the right number of copies are held for redundancy. More on that below.
And @joshuef and @Qi_ma have been looking at client connection issues thrown up by the playground and the comnet. We may have squashed one CPU intensive bug (we at least can no longer repro it at the moment), so we’ll be looking to verify that in an upcoming playground.
Preemptive data replication and adult errors
Properly functioning adults are the backbone of the network, and it’s imperative that should an adult start to misbehave it is replaced and the data it holds smoothly relocated. This is called preemptive data replication and is detailed in PR #976.
Liveness checks
Elders need to ensure that adults are performing properly. They perform regular liveness checks in which the performance of a node is compared with its 3 nearest neighbours. If the pending operations count at a node is 5 times higher than at its neighbours, it will be demoted and its data redistributed. To prepare for this eventuality, once the pending ops count of a node is 2.5 times higher than its neighbours (these parameters will be optimised during testing), pre-emptive replication starts, with elders currently initiating this replication.
When there is churn in a section (nodes leaving and joining) we need to make sure that data is replicated and distributed to the newly promoted nodes. When an adult is full, it also needs to tell the elders to store the chunk at another adult.
All of this requires some self-awareness by the adult node as to how full it is. Checking space is quite resource-intensive, so we only do it in steps of approximately 10% of the available space.
Adult errors
We need to generate errors to advise clients - and the system as a whole - when data is not being stored as it should be. This can happen for a variety of reasons. These errors will be made part of the network protocol with which all nodes must comply if they are to stay in the network.
Below is a list of errors that can arise at an adult node during PUT/GET operations (not counting AE and DKG errors) and the responses we are working on.
CouldNotStoreData
- the adult errored during storage, due to the adult’s storage mechanism. This is the adult’s fault. Possible causes are a failure to create directories, problems with the file system or the database used to store registers, corrupted registers, or wrong filepaths.
DataError
- the node did not store due to a data error. This is the client’s fault or possibly because the message has been corrupted. Either way (we cannot know) this should be returned to the client.
NodeFull
- the node is full! An error message is returned to the elder requesting the storage. We could possibly penalise adults that have not informed us beforehand that their storage levels are getting low.
Error spam
As well as informing clients, we can also make use of these faults as signalling something has gone wrong. At the same time, we need to avoid overwhelming the elders with too much messaging to-and-fro.
In handling these errors we need to ensure we are not opening up new attack vectors, allowing malicious users to knowingly perform illegal operations to DDoS the network by generating masses of error messages. As a future measure, it is possible that we could blacklist clients observed to be behaving in this way.
Useful Links
Feel free to reply below with links to translations of this dev update and moderators will add them here:
Russian ; German ; Spanish ; French; Bulgarian
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!