Update 19 May, 2022

A healthy network is made up of healthy nodes, devices that do what is required of them as expected, within an accepted range of performance. It is the job of the elders to make sure the adults in their section are up to scratch and to take action by voting if they spot anything amiss. Dysfunction tracking is a big part of what the team is working on now.

General progress

This week we realised we were potentially losing nodes during high churn and splits, with the node thinking it had joined, but with churn the valid section key had moved on. The node, thinking everything was fine, would not try to reconnect. So @anselme has been working on verifying our network knowledge and after updating it, retrying the join process for these lost nodes.

DBC work continues, with the initial steps for storing the SpentBook in place.

In running testnets to debug data instability weā€™ve uncovered a pair of potential deadlocks. In one instance we saw DataStorage reads hanging during replication. So we set about creating some benchmarks to test ChunkStorage, and with this we uncovered a deadlock in the underlying storage code.

Another (unconfirmed) lock was during the PeerLink cleanup process. Itā€™s difficult to say if this was definitely happening, but we had seen stalled nodes and our cleanup code had been called frequently around the lockups. Digging in, there was a lot of potential for simplification, so we did just that.

Yet another potential lock was occurring in dysfunction, with the nested data structure incorrectly being read without a mut lock. :open_mouth: :man_facepalming:

Itā€™s difficult to say if all these were happening for certain (requiring specific conditions to arise), but certainly these changes should be steps in the right direction!

With all those in, weā€™ve also got the DKG-Handover with Generation work integrated at last (which was much more stable atop other recent changes). Weā€™re chipping away at the stability, with more tests/benchmarks to come for DataStorage, and some other tweaks to data replication flows being tested (we need to expand the pool of nodes we ask for data, as with heavy churn, the odds of hitting another freshly joined node increase and it looks like we start to lose data retention!).

Dysfunction tracking

Dysfunction tracking is an ongoing process. There are always new behaviours to test and model, and so itā€™s an iterative process, allowing improvements in performance and stability as we move forward.

Dysfunction tracking is different from handling malice. Malice is an objectively provable bad action, such as signing an invalid network message, and itā€™s the job of all nodes to identify such nodes and punish them. Dysfunction, on the other hand, means ā€˜substandard behaviourā€™ and itā€™s the duty of the elders to work out what that means.

A node may be underperforming because of environmental factors such as temporary local internet slowness, or conditions that build up over time such as insufficient storage or memory. Or dysfunction may be sudden, such as a power failure or forced reboot.

Dysfunction covers operational factors too, including the quality and number of messages, and the storing and releasing of data on request.

Some types of dysfunction (data loss, extended connectivity loss) are more serious than others and so should be treated differently.

We can test a nodeā€™s performance relative to other nodes in the section, but what if the whole section is substandard relative to other sections? What then?

As you can tell, dysfunction tracking is a complex issue with many variables.

Goals of dysfunction tracking

By monitoring nodes as they go about their duties, we want to nip any problems in the bud before they grow. Ejecting a node should be a last resort, instead we want to take action - or have the owner of the node, the farmer, take action - to correct the issue.

This process of progressively correcting node behaviour should be automated and flexible, able to react to changing conditions rather than based on arbitrary hard-coded parameters.

Dysfunction tracking is about optimisation, but itā€™s not just optimising for performance - that would lead to centralisation. Home users would be unable to compete with data centre instances - they would always be relatively dysfunctional.

Where we are

Currently we have a simplified version of dysfunction in place.

Liveness testing checks that nodes are online, enabling elders to take action if they are not. This has been expanded to penalise nodes not only for dropping chunks, but also for dropping connections and being behind in terms of network knowledge.

Liveness testing, once tweaked to ensure weā€™re not being too harsh (as we have been so far) or too soft on misbehaving nodes is a good first step in ensuring a stable network, but with dysfunction tracking we want to go further and build a model that optimises for other factors too.

Penalising nodes currently means dropping them, but will soon incorporate other measures like halving the node age. Decisions on what course of action to take will be made by the elders through consensus.

As mentioned, some degree of ā€˜dysfunctional behaviourā€™ is inevitable, and right now weā€™re experimenting with classifying nodes as good (95%+ success ratio for a given operation), mediocre (75%) or bad (30%) so we can treat each class differently according to its seriousness, without hard coding any ā€˜expectedā€™ values (PR #1179). We also want to know how many nodes of each class are in our section.

The sn_dysfunction crate is here.

Where weā€™re going

We want to expand the number of tests we do and parameters we check in order to ensure we are modelling the network in a meaningful way.

Ideas include regular polling of adults to give them a small proof-of-work which could also check whether they are holding an up-to-date version of some mutable data. Mutable data is a CRDT and will eventually converge, but intermittent connectivity may mean that one replica returns an out-of-date version. How out-of-date it is would have a bearing on its dysfunction score, a cumulative tally accorded to each node. If this exceeds a threshold value, that node may be reclassified as ā€˜mediocreā€™ or ā€˜badā€™ and be dealt with accordingly. Over what time period we keep adding to the score will need testing.

Not all issues can be tracked by comparing performance between peers (if all nodes are bad that means a bad section; if too many sections are bad that means a bad network), so we will look at suitable global parameters.

We also need to consider how we can use dysfunction tracking to encourage node diversity and avoid leading to centralisation, so that both an AWS instance and a Raspberry Pi can play a role.

And we want to consider how we present dysfunction information to the end user, the farmer, possibly as a set of default parameters that are tweakable via a config file, and later via a GUI.

Ultimately, Safe is like a global computer, with the elders being a multithreaded CPU and the adults as a giant hard drive. We want this hard drive to be self-healing, and for the CPU to be able to adapt to changes.

All of this takes us into the realm of chaos engineering, as used by the likes of Netflix and Google on their cloud platforms, to ensure that individual server failures donā€™t bring the whole system down.

There is likely something we can learn here too as we work to make the Safe Network robust and reliable.


Useful Links

Feel free to reply below with links to translations of this dev update and moderators will add them here:

:russia: Russian ; :germany: German ; :spain: Spanish ; :france: French; :bulgaria: Bulgarian

As an open source project, weā€™re always looking for feedback, comments and community contributions - so donā€™t be shy, join in and letā€™s create the Safe Network together!

69 Likes

One two three first

Second week in a row you can see Iā€™m just stalking this thread :slight_smile:

Well done to to all the team for all the hard work!!

17 Likes

Yeah, Silver!

12 Likes

Ultimately, Safe is like a global computer, with the elders being a multithreaded CPU and the adults as a giant hard drive. We want this hard drive to be self-healing, and for the CPU to be able to adapt to changes.

Yes. Yes please. Thatā€™s what we need !
Keep pushing guys ! Weā€™ll wait, not far behind :stuck_out_tongue:

21 Likes

On that topic, I was wondering if there is something planned for expected outages, e.g. reboot for updates, change of hardware, repairs etc.
Could a node in such a case tell the elders ā€œIā€™m off but Iā€™ll be back shortlyā€ and work off queued up requests before doing so?
The benefit there being that no penalties would occur to the node and data doesnā€™t have to be unnecessarily replicated if it really comes back soon enough.
Iā€™m not sure if that may create an area of attack though.

13 Likes

Great update!

The necessary complexity of SN is becoming more and more obvious. Dealing with the realities of highly variable hardware, as well as potential attack vectors is an incredibly hard problem to solve.

Much respect to the team for driving this train forward.

Small steps eventually cover mountains though and your patience will surely pay off in time.

Cheers Maidsafe team!

26 Likes

Love this term, and no better team in the world to give order to chaos!

8 Likes

Thanks so much to the entire Maidsafe team for all of your hard work! :racehorse:

9 Likes

Nice. Keep up the good workšŸ‘

10 Likes

Letā€™s gooo! :tada: :100:

8 Likes

Dysfunction/malice handling is an interesting aspect to me because the dysfunction/malice tracking needs to be tracked recursively to make sure nodes arenā€™t being too lenient/maliciously lenient.

It seems like there are 2 axes here: The intention axis (how likely is it to be intentional), and the severity axis (how dangerous is it). Surely nothing can be truly provably intentionally malicious? Like, a node receives an invalid message, calculates that itā€™s invalid, the bit in memory that marks the validity as False gets hit by a cosmic ray and flipped to True, and the node then signs the message thinking itā€™s just determined itā€™s valid.

This is why I think there should be a way to set the penalty for every offence in a way that can be changed. What works one year for optimising might need to be tweaked for the next year in an update. So, some function for each version of the network that takes as arguments the offence type and any related arguments, and returns a penalty.

Maybe with every penalty and pardoned dysfunction there should be a message sent to the dysfunctional node explaining exactly what they are deemed to have done wrong. Could be good for in-house troubleshooting too.

As I understand it, the node age increments for every 2^node_age churn events, so node_age ā‰ˆ log2(churn_age). This means that node age-halving roughly square-roots the progress the node has made so far. So, if a node churns 1 000 000 times, its node age will be like 20, and if it gets halved, then it will be back at the same level a node which has churned 1000 times would be at, wiping out 99.9% of its progress.

Maybe it would be better to subtract from the node age rather than dividing it; that way a somewhat fixed proportion of the progress would be reversed. Also, having the node age as a power of 2 of the churn age probably makes it a tiny bit quicker to calculate, but aligning it to a base of like 1.1 instead of 2 would allow for more fine-tuning of penalties. Also, people keeping track of their node ages could receive more updates on their progress. My last suggestion is that the term ā€˜node ageā€™ be replaced by ā€˜node levelā€™ or ā€˜node qualificationā€™ to avoid users being confused and wondering why their nodes donā€™t age in linear time.

This would rely on network events, which in turn rely on user interactions like GETs and PUTs, right? Iā€™m just thinking, if the network stops being used (maybe humanity is destroyed and the network becomes the main archive of our knowledge), then surely all polls of adults would stop and then data would slowly corrupt silently?

Still interested to see whether the network will branch into 2 networks, one being a socialist enough-for-everyone branch, and the other being a capitalist maximum performance branch. These configuration settings would surely make it easier to branch without duplicating code/effort.

16 Likes

I support the big goal of SN. I feel the sincerity through the report every week of SN. And Iā€™m always grateful for the new information every week. Thx @maidsafe team

13 Likes

Very interesting thoughts @to7m, thanks for sharing.

10 Likes

Thx Maidsafe devs 4 all your hard work.

Hopefully all these write ups one day becomes: ā€˜mastering SAFE Networkā€™. Itā€™s so unbelievable how much care is taken into getting this puppy running.

Please do keep hacking super ants and community members setting up testnets :clap::clap::clap::crazy_face:

11 Likes

Thanks for hard working! We are getting close to launch day by day!

10 Likes

Sorry just saw this.

Provably intentionally bad vs effectively bad? Signing bogus data and claiming one thing is something we can log. If thatā€™s happening often we can probably say with some certainty thereā€™s malice? But thatā€™ll come down toā€¦

Yeh I think down the line itā€™ll be configurable (I expect most of a nodeā€™s setup will be). Effectively allowing decentralized determination of dysfunction .

10 Likes

Surely if a node can be configured to set different penalties, then they could just opt not to set any penalty based on whether they like that particular node or dislike it? Like, I assumed the punishment for a given offence would be a constant for the network or possibly for the section, otherwise how would a node verify that another node is issuing fair punishments?

Theyā€™d have to be behaving inline with what other nodes expect to not be seen as dysfunctional themselves.

Ideally thereā€™d be tolerance there to allow for various viewpoints, as not all nodes will see all things the same, regardless of config.

But in the end, we cannot control how/when nodes send out messages regarding dysfunction/malice. I mean: thereā€™s nothing that forces anyone to use our node implementation vs any other.

So it all centers on nodes agreeing together if something is seeming out of line. If that makes sense?

2 Likes

This is an intriguing proposition. Makes me think of evolutionary software so now Iā€™m wondering how these features could be evolved. Node age stands out as a proxy for fitness, so perhaps the ability for nodes to ā€œpass onā€ characteristics (eg dysfunction settings) to new nodes as a function of age.

9 Likes