Summary
Here are some of the main things to highlight since the last dev update:
- qp2p had some focus this week with performance improvements made, particularly around connection pooling.
- Chunk replication is being reintroduced to the code.
- We are excited to hear about a recent academic paper by the AT2 authors titled Dynamic Byzantine Reliable Broadcast, which seems to provide a formally proven solution for our exact problem.
- In sn_routing we are in the process of adding a join_flag to toggle whether to accept new nodes into the network or not.
Safe Client, Nodes and qp2p
Safe Network Transfers Project Plan
Safe Client Project Plan
Safe Network Node Project Plan
First off, this week in qp2p we implemented connection pooling. This means that if a node wants to connect to a peer and the connection has been opened before (and is still open), we will reuse the existing connection instead of establishing a new one. This improves performance because establishing a connection is expensive (it involves a TLS handshake, among other things). This also improves ergonomics because the users of qp2p don’t need to worry about connection caching anymore. We also implemented connection deduplication which means that multiple concurrent connection attempts to the same peer will all resolve to the same connection instead of opening a separate connection each. This again improves performance.
We’ve been getting back at chunk replication of Blob data at sn_node. Starting out with a 4x replication factor, the network’s adults will be primarily responsible for storing them. We are porting the older implementation from pre-async vaults to the new codebase, adapting it to the latest changes of storing, querying, etc. We have also been doing a bit of maintenance all across the board this week by hunting down sneaky unwrap
s, expect
s and panic
s from our production code and tests, essentially stabilising our codebase and catching all exceptions. This is best practice and something we’ve been putting off for too long. Look out for additional CI checks being added in the next few days across our crates ensuring these don’t sneak back into our code.
We invested a bit of time in researching and thinking about how the APIs will eventually need to evolve to support signing requests from the client using multiple key pairs rather than just one. For example, a client may want to store a file that would be owned by a public key whilst the payment for such operation would be made using a second public key that owns the funds, and perhaps a third keypair may be used by the client for encrypting the file’s content. This is not something we are considering high priority at the moment, more of a PoC to help us identify the challenges and realise how to eventually evolve our client APIs.
CRDTs
Work continues on dynamic membership in DSB and this week our consultant has written a test case that demonstrates a concurrency problem with data operations while a member is leaving the group. A correct implementation of dynamic membership should always pass the test while our current naive implementation fails, so we now have something concrete to measure against.
To that end, we are excited about a recent academic paper by the AT2 authors titled Dynamic Byzantine Reliable Broadcast. It provides quote: “the first specification of a dynamic Byzantine reliable broadcast (dbrb) primitive that is amenable to an asynchronous implementation”. In other words, this paper provides a formally proven solution for exactly our problem.
Our consultant is presently reviewing this paper as well as another possible solution using something called a Generation Clock that might not require as much network communication.
Routing
As mentioned in last week’s update, the work to allow a node to rejoin with the same name was this week approved and merged. This means any rejoining node would be immediately relocated with half its age, as long as the halved age is greater than the MIN_AGE
(currently 4
). This is designed to discourage malicious restarts.
We had observed during internal testing that the genesis node was sometimes being demoted too quickly, this was down to a recent change where we now make nodes with a random range of ages during the startup phase. To resolve this we decided to Start the first node with a higher age, currently set to 32. This has now been merged to master. This ensures the genesis node stays stable as an elder for a sufficiently long time, which eases many things for testing and testnet setup.
The ongoing work to improve lost peer detection is progressing well. We have already taken advantage of the new connection pooling feature in qp2p, which allowed us to simplify the code. Some initial integration tests show the refactoring works well. This PR is now going through some final review and testing and so will hopefully be merged soon.
To ensure that when a node’s resources are close to being used up there will be new nodes flowing in to share the workload, we are going to allow nodes to tell routing to accept new nodes or not. This restriction on when the network accepts new nodes will also help prevent sybil attacks by not allowing potential attackers to add unlimited nodes at will.
In addition to this we’re also looking to make some changes to assist in making us aware of routing’s exact status during network startup & beyond, we have two ongoing PRs Indication for section start-up and firing PromotedToAdult event and notify when key is changed during relocation. These in progress changes will help us ensure that routing is behaving as intended, and should be completed soon, once the API designs are agreed between nodes and routing.
Useful Links
Feel free to reply below with links to translations of this dev update and moderators will add them here:
Bulgarian; Russian; German; Spanish; French
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!