On the back of last week’s interesting testnet, which wasn’t functional for as long as others but nonetheless stayed up and produced some useful findings, we’ve decided to do a little explainer on nodes joining. This is something we are trying to make more reliable and easier for folks to try out. So if you notice some weird goings in your logs, that’ll be why.
Big shout out to @josh for his Gooey app, a friendly GUI to help overcome FCL (fear of the command line). Using Gooey you can PUT and GET files to your heart’s content, without feeling like you’re entering the Matrix. Initiatives like this, @happybeing’s Vdash monitoring app and @southside’s scripts are what makes this such a great community.
General progress
First some good news on the legals: Safe Network Foundation has been duly registered with the Geneva Commercial Register. @JimCollinson and @andrew.james have been working on this for many months and it represents a major milestone passed. We’ve also submitted a non-action letter to FINMA. This lays out legal opinion that SNT is a utility token for the purchase of storage and not a security, meaning we can crack on with launching.
We’re still digging into the stable set idea - using elders and the oldest adult nodes as primary data storage, while nodes that are going through the node age process are given less important secondary storage tasks to perform. @davidrusu is running experiments on that, while @anselme has been working through possible security implications.
The stable set idea brings a few interesting changes:
- No DKG (as oldest nodes are instantly elders). This removes a significant network load on elder change.
- No nodes in stable set relocate, so this is an append-only set
- Nodes in stable set have no age marker, their age is their relative location in the set (based on first seen)
- New elders sign old elders in the SAP (Section Authority Provider), this allows us to recover from lost consensus nodes (mass churn handling)
- It’s likely we can handle partitions fairly easily with these changes, although we have no mechanism to allow partitioned networks to reconnect as yet.
@oetyng is deep into the comms between clients and nodes, getting redundant types and patterns cleaned up and trying to straighten things out there.
@anselme continues to refactor DBCs. Yesterday he replaced bls_bulletproofs with the upstream bulletproofs in sn_dbc . This not only makes the code safer as we use audited code, but also improves performance significantly!
2096 ms → 234 ms : Benchmarking reissue split 1 to 100
160s → 40s : running all tests
On the node front, @joshuef noticed that when a node rejoins it is given a different name, which requires some extra logic. That’s been fixed and nodes now rejoin with the same name.
@Chriso has succeeded in getting EC2 instances that participate in a testnet configured for telemetry collector and data prepper services, which is needed for load distribution and other tasks. He also has a testnet up and running on AWS which is submitting traces to OpenSearch. Should be a PR incoming.
@Roland has been writing an explainer on telemetry, why it’s needed and how we are using it, which we should be able to share soon.
@qi_ma is debugging node joins and relocation, and @bochacho is chipping away at messages and qp2p.
And @bzee has been stripping out redundant code from the node join process
Node join
Which brings us neatly to nodes from home. The last testnet was the first time in a while that we’ve enabled nodes joining from home (or cloud VM), and also the first ‘official’ test with smaller nodes (after a comnet saw successes there). Overall we were pleased that many folks managed to join, although PUTs seized up fairly quickly. We’ve identified what was going on here, with node relocations being sparser than we’d expected, our initial network was not “aged” enough to provide a stable startup phase.
Ideally we want everyone to be able to join from anywhere on any device, when the network needs more storage. There have been some issues with timeouts, when a node asks to join then loses connection with the elder, that we’ve mostly fixed now (nodes will receive a response saying that voting is underway, as opposed to no response; nodes also now just keep trying with the same name as opposed to switching names, which has confused the join process previously).
Some errors can occur when a node joins just after membership (the makeup of the section) has been shared between elders after a DKG session. When this happens, the new node does not get counted and there is a split view. It’s something @qi_ma is correcting now.
The safe node join
flags of --skip-auto-port-forwarding
and --public-addr
have now been deprecated as we have removed the UPnP/IGD port forwarding, which has never been reliable. This has been cleared out to simplify qp2p for now. We may well bring this back down the line, but it’s not a priority at this time. You should just be able to join with safe node join --network-name
now, although NAT will no doubt be an issue for some home nodes.
That’s because for nodes in a P2P network to be able to talk, they need to be able to find each other, but with NAT their exact address is hidden by the router which only shows the public IP. NAT was introduced as a hack to get around IPv4 running out of addresses, but unfortunately it has stuck around. Different routers and ISPs have different ways of implementing port forwarding, which not everyone will be confident to try anyway. IPv6 solves the problem, but unfortunately take-up is still quite low. It’s a problem for all decentralised networks, and we continue to look for solutions.
We plan to phase out the safe node
command in favour of using the sn_node
binary directly, as @Chriso suggests in this post. The node
command is essentially a thin wrapper around the sn_node
binary and doesn’t offer much advantage over just using the node binary directly. Furthermore, it leads to maintenance issues in terms of keeping supported arguments synchronised between the two. The node will most likely get its own install script. The safe node run-baby-fleming
command will be retained, but probably become safe run-baby-fleming
.
Useful Links
Feel free to reply below with links to translations of this dev update and moderators will add them here:
Russian ; German ; Spanish ; French; Bulgarian
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!