Previous posts in the Road to SAFE Fleming
Like Network restarts, Network upgrades represent a big topic. The subject is also still very much a work-in-progress as it doesn’t fall within the scope of the upcoming Fleming release. This post will explain how we’ve been exploring the options open to us at this stage in order to ensure that the Fleming work takes into account how we see Network upgrades taking place.
Some context
SAFE strives to provide a reliable infrastructure. Like any long-lived software, SAFE will need to adapt to changes and users’ needs over time.
Smoothly upgrading a simple network can be tricky, but upgrading a peer-to-peer network brings its own unique challenges. We need to provide for upgrades that don’t rely on central authorities, can be controlled by the users and verified by the Network.
In addition, we also want to be able to develop on the Network as soon as possible without any temporary requirement to shut it down to let upgrades take place. This means that we’ll likely start with a minimum viable upgrade feature which will inevitably have a number of limitations. But we’ll make sure we can improve it gradually as we move towards our goal.
Why do we need to address this challenge?
Upgrading software can be disruptive. Think about upgrading your browser. You normally have to restart it - and that’s just for a browser. Updating your browser has no impact on the Internet itself. But if you need to restart a peer in the Network to carry out an upgrade, that will affect other peers as peers provide services to each other.
The Network is designed to handle peers going offline, so this isn’t a problem as such. But we do need to ensure that the Network isn’t designed in such a way that upgrades will be problematic. It’s also important to remember that upgrades may require state to persist during the process (so that a peer can return to its job afterwards).
As a result, we’ve spent time understanding the requirements as clearly as possible at this time.
What can we expect from a good solution?
At this stage, a good solution will have two key characteristics:
- It is as minimal and simple as possible (to speed up initial deployment).
- It provides the basis for building the upgrade solution we want in the future.
We assume that any solution will require the binary a user downloads to access the Network to be replaced and therefore some downtime. We also assume that this downtime will be shorter than the time at which the Network makes a decision to remove a peer from the Network for non-responsiveness. Doing this means that the upgrade will not negatively affect a node’s age.
A proposed solution
Most approaches to upgrading software expect to build upon existing pieces of software. Let’s start first with a clear question: how does an upgrade handle peers that are running different versions of the software? A common approach is to embed the version reference in each message between peers. That means the receiving peer can decide whether it is a version it can accept, or must reject. Our initial thinking suggests that doing this by using a single byte (allowing 255 versions) and then cycling back to 0 would be sufficient with an appropriate mechanism to avoid installing old binaries when cycling again through the short version numbers.
For deeper protocol changes, a peer could choose to accept multiple message versions and treat them appropriately. Once this transition period is complete, the special multi-version handling can then be removed to keep the code clean.
Another key question is how to ensure that a peer can continue its work seamlessly after an upgrade. To do this, it’s important that its state persists, and that it can reload it. This won’t just be used for Upgrades - it is also a key feature to enable Network Restarts (where a node needs to come back after an unexpected shutdown).
There are two main areas we likely need to persist:
- The messages in transit.
- Its keys, chain and PARSEC state (on an ongoing basis to support restart).
It’s important to remember that we don’t necessarily want every peer to act at the same time. If too many peers leave the Network at the same time, the Network functionality will degrade significantly. In this case, we see two main approaches for upgrades:
- A staged, slow upgrade - few peers are unavailable simultaneously so the Network handles it with no disruption.
- A very fast upgrade - this would propagate very quickly across the Network with a ripple effect ensuring that no messages or transactions are lost.
Fast Network Upgrade
Out of the two, the very fast upgrade may be significantly simpler to put in place so this may be our initial approach in development. There are some clear limitations with this approach, but it should effectively let us add all the planned feature upgrades to the Network over time.
Once a node has a valid signed upgrade, this process would rapidly propagate upgrades to other nodes by refusing to communicate with older versions with an UpgradeRequired
error. On receiving this error, peers would send a ProvideUpgradeBinary
to which the upgraded node would respond.
A key benefit is that this allows faster development as nodes would then only need to talk to other nodes running their current version. In some cases, fast upgrades can kill a network so it’s important that we design this appropriately in order to ensure that it allows recovery of all the data and any restarts before any timeouts created a sudden collapse.
Slower Network Upgrades
Another alternative is the slow upgrade. This would allow a node to be voted as being in an UpgradingState
state, during which it stops having responsibilities. Once they finished providing services, the peer could upgrade without disruption to other nodes. They could then rejoin and be provided with the information that they were holding before the process started. This provides a different set of trade-offs. We may not need to persist as much data - but we would need at least two consecutive peer versions to work effectively together. This is a challenging proposal in itself.
What’s next?
Whilst we’ve not yet finalised our approach to Network Upgrades, the work has definitely highlighted which are the important design aspects that we have to consider. As we progress with Fleming, the specifics of the Network will become more concrete and settled. At that point, we’ll then move on to identifying the actual steps needed to enable upgrades to the Network.
With each of these posts we are always thankful for the community’s feedback and insights. Now that you’ve read through the above, please do feel free to jump onto the thread and share your thoughts on Upgrades. We’re always hugely grateful to the Community for its input so thanks for taking the time
Next up we’ll be looking at another aspect of the Network - how messages are routed - and we’ll be comparing our implemented “Disjoint Sections” approach with standard Kademlia.