Update 10 March, 2022

Right now, joining a test net as a node is like starting the worst job you ever had. That one where you were dropped into the thick of it by a sadistic boss, with no proper training or instructions; where you had to deal with constant demands to deliver this message or do that task; and where at some stage you ran out of time or neglected to smile - at which point you were promptly fired.

While we don’t want malingering nodes screwing up the network, the rules as they are currently implemented amount to ‘one strike and you’re out’, which is why recent playgrounds have been short-lived: every time we hoof out a node, the data it holds needs to be relocated, causing floods of chunks and messages, which lead ultimately to network seizure.

This is not entirely unexpected as many optimisations are yet to be put in place, as we explain below, but there’s no substitute for real-world testing for showing where we need to focus - so heartfelt :heart: thanks for everyone who joined in the playgrounds and comnets. It may sometimes feel like we’re going backwards, but fear not! It’s all part of the plan.

General progress

@Jimcollinson, @heather_burns, and @andrew.james have been working through the documentation required by the Swiss authorities in setting up the new foundation there. The good news is, it’s all eminently doable and there are no obvious hurdles, which validates our choice of that country. Our documentation has been submitted for the foundation’s incorporation, and we’ll shortly start work on our registration with the Swiss financial authority.

@Anselme has been working on section handovers and consensus – how we select which adults will be promoted to elders on a split and how we resolve the situation when an adult joining a section is older than the elders. We need to ensure the elders agree on the same set of candidates when handover is performed, so consensus is needed. @Davidrusu has now pretty much completed the integration of the consensus algorithm, so we’re ready to integrate it.

As well as finalising the pull model and liveness tests (see below), @Yogesh has set up a local dashboard using ELK and Filebeat so we can analyse the logs more easily. Results so far are good, and he’s now working to make it more robust, and to ensure it captures all the metrics it needs to.

Nodes are here to help – so stop overwhelming them

So why is the network less stable at the moment? The answer is that measures introduced to test for dysfunctional behaviour are currently all or nothing: previously they were switched off, so ‘nothing’, now they are ‘all’. We are basically killing off nodes for minor misdemeanours which means excessive churn and data relocation. So we need to dial back the punishments and introduce other checks.

Right now elders only track a nodes ‘liveness’ around its handling of data. However, this is only one metric for determining if nodes are behaving in a dysfunctional manner. We also need to manage (and compare) things like: connectivity, message quality and number of messages.

We need to watch connectivity in case of nodes rebooting/upgrading. If they can reboot and still be responsive we should not demote them, but we first need to check this is the case. Because the network has no time, their responsiveness needs to be relative to neighbours’ activities. Messaging can be monitored similarly.

Because it’s handy to have all this functionality for checking for malicious or malfunctioning nodes in one place, we are considering a new top-level crate to go in the safe_network repo, replacing the liveness tracking we have now. This new crate will have expanded functionality to allow us to track and manage all kinds of node dysfunction.

Data replication

Data replication is another major factor for a smooth running, stable network.

When a node goes offline, we need to transfer its data to other nodes. In the past, data was pushed from one node to the other, controlled by elders.

We now have adult-to-adult messaging for data replication, whereby if an adult goes offline or a new adult joins, all the other adults know about it. Knowing the current membership, every adult can calculate what data they should be holding and what data needs to be redistributed/replicated at the other adults to ensure the network maintains a minimum number of copies of a chunk.

This works in theory, but the playgrounds and comnets have demonstrated a few practical shortcomings, including replication messages failing to reach their targets and being lost, malicious nodes deliberately dropping messages, and bursts of data and messages when many adults notice a node going offline at the same time – which again leads to messages getting dropped.

The new approach, which @yogesh has been working on, aims to solve these limitations by implementing a pull model. Whenever there is a change in the set of Adults, nodes will be notifying each other of what data they should be holding. The receiving nodes will then be responsible for pulling this data from any one of the existing nodes that hold it.

This makes sure that adults only pull data that they are meant to hold and are responsible for working this out. If they already have the data, the flow cuts off at the notification messages round. Data is sent only when replication is required - instead of the current fire-and-forget messaging which takes the network’s capacity for granted.

We will also batch the data to reduce the number of messages required. Our playground testnet proved one message per chunk was not efficient and nodes were going :boom: when there was a lot of data to be replicated.

Once this is fully in place, we’ll fire up a new playground to test it out.


Useful Links

Feel free to reply below with links to translations of this dev update and moderators will add them here:

:russia: Russian ; :germany: German ; :spain: Spanish ; :france: French; :bulgaria: Bulgarian

As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!

64 Likes

First? Hardly ever happens!

Lookin’ good. I’ve wondered about node data replication before. From my amateur armchair, it looks like things are moving in a good direction.

Great work Maidsafe team … will reread tomorrow.

Now for sleep! :wink:

24 Likes

Thanks so much to the entire Maidsafe team for all of your hard work! :racehorse:

14 Likes

Fabulous! This really gives even more insight into the process of finding and handling the difficulties, practical considerations, etc., as things develop.

I really like the model shift of making the new Adult responsible for ensuring replication. Elegant.

18 Likes

Huh, hard times? Keep going well MaidSafe!

7 Likes

Excellent!.. Thx…
The current membership seems the nodes that have to have the same chunks, i.e. same data replication. right?

9 Likes

Remember back when the sacrificial chunks was a thing.

Well what about having more copies than needed. Like 2 times for instance.

These extra copies are sacrificial in the sense they can be copied to another adult when the chunk space is needed. This will be a slow process and not a flood.

Now when a node goes off line and its chunks need to be written to other nodes, then it can also be a much slower process since there is extra nodes that already have those chunks. Only need to do it fast if the number of nodes holding the chunks drop to the minimum number required. (4 at the moment)

  • If required nodes holding a chunk is 4 then have 8 as preferred. 4 required and 4 sacrifical
  • If one of the sacrificial nodes needs space (free space drops below minimum desired free space) then it relocates a sacrificial chunk to another node which is again a sacrificial chunk on that node.
  • if one of the required nodes goes off line then the chunks it held are then scheduled on the other nodes to be copied to other nodes to keep the required/preferred copy counts for each chunk.

This means there is more time in general available to restore the copy counts of each chunk. Obviously the lower the count the more urgent it is to restore the desired copy count of each node.

It might be simpler to just treat all as sacrificial and require immediate copies made for any chunk only held on 4 (or 3) nodes. But not sure if that would cause other issues in the algorithm

20 Likes

Great work Maidsafe team!

I urge to consider that node behaviour control requirements will be different at different level of network maturity.

And what I mean is that a lot of this dialing might not be necessary if we accepted that the network could have a ‘controlled’ start performed by a number of trusted members that would provide nodes with guaranteed uptime and performance, for a certain period of time. Looks to me like this would allow messaging and data replication to ‘settle’ and the network to be mature enough to handle less reliable nodes coming and going.

Not saying this is not useful work, surely will be essential in case of big scale attacks or network restart; just questioning whether it’s necessary now if we can get a stable network through a controlled start.
Just my .02$

4 Likes

I think we’re likely moving towards this aye. First thing though is to ensure the “stasndard” replication (if you like) is all going okay.

Right now, eg, we’ve removed chunk “delete” so nodes won’t be cleaning up for now, and more nodes will hold any chunk (if they ever held it).

We’ll be looking at mechanisms to easily query more nodes than “the usual suspects” (those 4 that should hold a chunk) and expand that from there (I think).


We actually have some of this in place for initial sections. So increasing nodes per section would expand this easily enough.

What we’re seeing here though is just us not taking an average of node’s behaviour, so every little thing is deemed worthy of killing a node. Which is obviously too much and causes too much stress. If even only our trusted nodes are having this issue, then our baseline is wrong.

So you could look at the dysfunctional detection work as aligning that baseline for well behaving nodes (“trusted” or no). And then beyond that we should have more robust dysfunctional detection.

19 Likes

Could the minimum number of 4 or 8 replicas be too constraining here? Would higher reliplication counts make the decision making easier?

For example, what if all adults store 1/2 the data in a section (the closest 1/2 in XOR distance)? That way, when the section splits, you know that there are plenty of copies of data and nothing will go missing. At the same time the replication count is capped to no more than N/2 adults per section. KISS.

5 Likes

Huhah ! I can feel the launch of the network!

5 Likes

Thx Maidsafe devs for the update

Really love reading about how the SAFE network works (pauses my cluelessness and I can try explain it to others how it works :stuck_out_tongue_winking_eye:)

Please keep experimenting/improving this baby

Keep hacking super ants

12 Likes

This is incredible and solves so many problems.

A. Use cases that it’s reported being applied for today:

  • Membership and new joins to the network
  • Elder promotions

B. Use cases that it’ll likely be applied for in the future:

  • Upgrades
  • Quorum-based compute
  • Quorum-based oracles
11 Likes

Hi team,

Is the roadmap up to date?

Thanks

4 Likes

I think it is, but I think that some of the parts marked as complete on the roadmap are still being modified/adapted/debugged as the network itself comes together.

There are also some things not mentioned on the roadmap like the DBC development … so IMO, the map could use some Maid team love.

4 Likes

Thank you for the heavy work team MaidSafe! I add the translations in the first post :dragon:


Privacy. Security. Freedom

9 Likes

Just saw this in another thread, so I’d say no.

1 Like