Step-by-step: the road to Fleming, 5: Network upgrades

Jean-Philippe · April 17, 2019, 2:08pm

Previous posts in the Road to SAFE Fleming

Like Network restarts, Network upgrades represent a big topic. The subject is also still very much a work-in-progress as it doesn’t fall within the scope of the upcoming Fleming release. This post will explain how we’ve been exploring the options open to us at this stage in order to ensure that the Fleming work takes into account how we see Network upgrades taking place.

Some context

SAFE strives to provide a reliable infrastructure. Like any long-lived software, SAFE will need to adapt to changes and users’ needs over time.

Smoothly upgrading a simple network can be tricky, but upgrading a peer-to-peer network brings its own unique challenges. We need to provide for upgrades that don’t rely on central authorities, can be controlled by the users and verified by the Network.

In addition, we also want to be able to develop on the Network as soon as possible without any temporary requirement to shut it down to let upgrades take place. This means that we’ll likely start with a minimum viable upgrade feature which will inevitably have a number of limitations. But we’ll make sure we can improve it gradually as we move towards our goal.

Why do we need to address this challenge?

Upgrading software can be disruptive. Think about upgrading your browser. You normally have to restart it - and that’s just for a browser. Updating your browser has no impact on the Internet itself. But if you need to restart a peer in the Network to carry out an upgrade, that will affect other peers as peers provide services to each other.

The Network is designed to handle peers going offline, so this isn’t a problem as such. But we do need to ensure that the Network isn’t designed in such a way that upgrades will be problematic. It’s also important to remember that upgrades may require state to persist during the process (so that a peer can return to its job afterwards).

As a result, we’ve spent time understanding the requirements as clearly as possible at this time.

What can we expect from a good solution?

At this stage, a good solution will have two key characteristics:

It is as minimal and simple as possible (to speed up initial deployment).
It provides the basis for building the upgrade solution we want in the future.

We assume that any solution will require the binary a user downloads to access the Network to be replaced and therefore some downtime. We also assume that this downtime will be shorter than the time at which the Network makes a decision to remove a peer from the Network for non-responsiveness. Doing this means that the upgrade will not negatively affect a node’s age.

A proposed solution

Most approaches to upgrading software expect to build upon existing pieces of software. Let’s start first with a clear question: how does an upgrade handle peers that are running different versions of the software? A common approach is to embed the version reference in each message between peers. That means the receiving peer can decide whether it is a version it can accept, or must reject. Our initial thinking suggests that doing this by using a single byte (allowing 255 versions) and then cycling back to 0 would be sufficient with an appropriate mechanism to avoid installing old binaries when cycling again through the short version numbers.

For deeper protocol changes, a peer could choose to accept multiple message versions and treat them appropriately. Once this transition period is complete, the special multi-version handling can then be removed to keep the code clean.

Another key question is how to ensure that a peer can continue its work seamlessly after an upgrade. To do this, it’s important that its state persists, and that it can reload it. This won’t just be used for Upgrades - it is also a key feature to enable Network Restarts (where a node needs to come back after an unexpected shutdown).

There are two main areas we likely need to persist:

The messages in transit.
Its keys, chain and PARSEC state (on an ongoing basis to support restart).

It’s important to remember that we don’t necessarily want every peer to act at the same time. If too many peers leave the Network at the same time, the Network functionality will degrade significantly. In this case, we see two main approaches for upgrades:

A staged, slow upgrade - few peers are unavailable simultaneously so the Network handles it with no disruption.
A very fast upgrade - this would propagate very quickly across the Network with a ripple effect ensuring that no messages or transactions are lost.

Fast Network Upgrade

Out of the two, the very fast upgrade may be significantly simpler to put in place so this may be our initial approach in development. There are some clear limitations with this approach, but it should effectively let us add all the planned feature upgrades to the Network over time.

Once a node has a valid signed upgrade, this process would rapidly propagate upgrades to other nodes by refusing to communicate with older versions with an UpgradeRequired error. On receiving this error, peers would send a ProvideUpgradeBinary to which the upgraded node would respond.

A key benefit is that this allows faster development as nodes would then only need to talk to other nodes running their current version. In some cases, fast upgrades can kill a network so it’s important that we design this appropriately in order to ensure that it allows recovery of all the data and any restarts before any timeouts created a sudden collapse.

Slower Network Upgrades

Another alternative is the slow upgrade. This would allow a node to be voted as being in an UpgradingState state, during which it stops having responsibilities. Once they finished providing services, the peer could upgrade without disruption to other nodes. They could then rejoin and be provided with the information that they were holding before the process started. This provides a different set of trade-offs. We may not need to persist as much data - but we would need at least two consecutive peer versions to work effectively together. This is a challenging proposal in itself.

What’s next?

Whilst we’ve not yet finalised our approach to Network Upgrades, the work has definitely highlighted which are the important design aspects that we have to consider. As we progress with Fleming, the specifics of the Network will become more concrete and settled. At that point, we’ll then move on to identifying the actual steps needed to enable upgrades to the Network.

With each of these posts we are always thankful for the community’s feedback and insights. Now that you’ve read through the above, please do feel free to jump onto the thread and share your thoughts on Upgrades. We’re always hugely grateful to the Community for its input so thanks for taking the time

Next up we’ll be looking at another aspect of the Network - how messages are routed - and we’ll be comparing our implemented “Disjoint Sections” approach with standard Kademlia.

dirvine · April 17, 2019, 2:27pm

Liking the fast upgrade approach, it is so @maidsafe. We take the thing that brought down skype, consider it carefully and craft that weakness into a strength by having network nodes design for such “catastrophic” events. It feels just right, natural and very powerful as an initial mechanism for upgrades. Kudos routing team

urrtag · April 17, 2019, 2:41pm

But how do you want to achieve that? The “proposed solution” seem to be pretty centralized (signed binaries; other peers drop the connection if you’r not using a “certified” version; …).

Could this be a platform independent wasm file? That get’s executed via eg wasmtime?

tobbetj · April 17, 2019, 2:45pm

Just some intitial thoughts before the wizards come, except @dirvine, who already is here.
In my mind I see like 5% of a section being allowed to upgrade, then the section verifies the upgrade, then continues to upgrade the next 5% and so on. The nodes with least amount of responsibility would be upgraded first, When half of a section has been upgraded than it would speed up and upgrade like 30% of remaining nodes at the time. Important that nodes keep a copy of the older software version until it is verified that the new one works, on critical error they would default back to the older version and then retry upgrading.

pierrechevalier83 · April 17, 2019, 2:47pm

But how do you want to achieve that? The “proposed solution” seem to be pretty centralized (signed binaries; other peers drop the connection if you’r not using a “certified” version; …).

The idea is to start with a solution that works technically without worrying about the ideals, so we can deliver Fleming where upgrades are not the focus: resilient routing with Parsec + Node ageing to enable vaults from home is the focus of this one.

With any mechanism to allow upgrades, we can then go ahead and upgrade to a better upgrade mechanism once we’ve put the appropriate amount of effort into designing one that meets our standards.

So really, this post is more about having a high level idea of the challenges we’ll be facing and proposing a “good enough for a start” solution

Mindphreaker · April 17, 2019, 2:52pm

Yeah I think for an initial mechanism that would be sufficient. For post Fleming I think this should include some kind of P2P voting where peers can vote if they would like to accept an upgrade or not.

Mendrit · April 17, 2019, 6:20pm

There was recently discussion about same issue here:

https://forum.autonomi.community/t/step-by-step-the-road-to-fleming-1-the-big-questions-safe-fleming-and-beyond/27560/7?u=mendrit

Bogard · April 17, 2019, 7:03pm

This approach makes sense, and I couldn’t agree more.

For the finalized upgrades approach though, I think it’ll have to be the “slow” options. In some situations, slow change is good. But for the short term Fleming target, as you so well put it, the fast approach is optimal.

Anders · April 17, 2019, 7:41pm

On the fly network upgrades seems like a really challenging task to me. How will you ensure decentralization? Who decides what to upgrade and when in the network? Doesn’t that require some form of centralized authority? If not, how do you prevent attackers from upgrading the network with malicious intent?

dirvine · April 17, 2019, 7:47pm

Initially it is centralised as only us really are coding and providing the binaries.We also sign for security to make sure it is us. That is not decentralised and there is a lot we can do like reproducible builds, stuff like musl really helps there (easy change), but that needs a lot more teams/indi devs working on core. I think that will happen naturally, especially with dev rewards.

For now we are pushing for speed, so the network launches, gets devs working on it and more. So initially the network upgrade check is, is the binary signed by maidsafe? if so then upgrade.

Later it should not do that, but allow decentralised devs to update work and for that to be agreed, by farmers upgrading (like bitcoin etc.). Even better though would be network health checks on any upgrades. That will require a wee bit of AI (I think) and some move towards formal verification. Use of bulletproofs and snarks for proving correct execution of events and such like also will become prominent.

So yes initially centralised dev (it is) and work towards more indi devs, but with formal proof of the upgrade hopefully made by the network as the ultimate goal, poss intermediary steps like we see in other projects in the middle.

Anders · April 17, 2019, 8:14pm

My guess is that it will be difficult to remove an upgrade function once it has been introduced. I think it’s better to do upgrades of the network up to and including beta versions, and after that the production version should be set in stone. Otherwise big companies and even governments and lots of individuals would think of the network as shaky as it is then susceptible to changes into who knows what. Nothing other than true decentralization and fixed standard will work in the long run.

tjf · April 17, 2019, 8:18pm

I see what you did there😊

dirvine · April 17, 2019, 8:56pm

For every single decentralised project. We are not alone.

Anders · April 17, 2019, 9:16pm

You are alone in the sense of the scope of your project. Bitcoin minus its politburo of control is a decentralized system, but the SAFE network is a much bigger project. Ethereum is only semi-decentralized IMO since they need some authorities to change its specifications to deal with performance issues etc. Not good.

dirvine · April 17, 2019, 9:39pm

I mean no decentralised project of any size is currently decentralised in it’s upgrade process. We have identified that as an issue with potential solutions, but right now we are pushing ahead with stage 1 of that. So when I say we are not alone I do mean all decentralised projects need to solve many issues, upgrades being one of them. Centralised data structures being another (like the blockchain) as well as a few other niggles. However upgrades are vital as they alone define the future of any codebase and this is where I believe we are not alone in having to solve that. I do feel we are alone in seeing it as a major issue that needs solved properly and to me that means without human involvement in the decision process of what do we upgrade to, not only how do we upgrade (say a security fix).

This is stage 1. Reproducible builds etc. could be stage 2 (still not decentralised though) and for me stage 3 means the network evaluates any upgrade and it alone decides on acceptance of that upgrade.

some projects try governance, but that always IMO leads to centralised control, so I feel we must go a different path and that path has to be more automatic. That is not simple, say the network can evaluate completely and upgrade as improving it. Then it needs to also decide on a new feature? Will people like that feature? do they want it? and so on. These are questions the network actually can answer IMO but the complexity there will be higher than a human dev can calculate, that is where I think we do have to lean on some AI (neuroevolution or SGD type thing)

So regardless of the scope of a project, if it is to be truly decentralised then I think it ends up in the same place, trust humans and trust them not to collude and control, or don’t. However the problem to be solved is the same, regardless of scope.

Anders · April 17, 2019, 9:48pm

Wow. That’s pretty ambitious. I like it! And I came to think that Maidsafe can keep full control over the SAFE network during the entire Beta phase to ensure stability, which could be several years as the network grows and increasingly becomes used for real projects and with real farmers.

dirvine · April 17, 2019, 9:53pm

Thanks for that :+1 I think for us to declare SAFE a success is the usual usage, user uptake, saving data in apparent perpetuity (we need infinity to prove perpetual ) etc. but most importantly when we MaidSafe are not required and really that has to be a big priority after we prove the rest of the goals of truly decentralised networking. That last part could take a while, but I sincerely hope not. During that phase though openness at least will help a lot and that will mean indi devs quickly to openly debate all code going into the system.

Mendrit · April 17, 2019, 9:58pm

And BFU will act like political parties ? 95% will not understand so they will trust somebody else who understand well ? It is delegated voting than.

Anders · April 17, 2019, 10:20pm

After the beta release not only open source developers will be interested in the details and be able to come up with proposals for upgrades but also big tech companies like Google and Huawei! Because with 5G and IoT the tech companies will need some commonly agreed upon platform to run their solutions on, and the SAFE network is a prime candidate for that.

19eddyjohn75 · April 17, 2019, 11:09pm

Deepmind surely seems super bored, maybe they would be interested in a different game

Another nice read

Topic		Replies	Views
Step-by-step: the road to Fleming, 1: The Big Questions: SAFE Fleming and Beyond Development development , fleming , routing	30	3336	July 11, 2019
Step-by-step: the road to Fleming, 4: Network restarts Development development , fleming , routing , restarts	21	2351	April 7, 2019
SAFE Network Dev Update - April 2, 2020 Updates	49	3706	April 25, 2020
SAFE Network Dev Update - February 14, 2019 Updates	47	4407	February 19, 2019
Step-by-step: the road to Fleming, 0: What is SAFE-Fleming? Development development , fleming , routing	15	3678	February 1, 2019