The Safe Network is built around the concept of keeping the building blocks simple. Individual elements react to external stimuli in limited and predictable ways, yet combine to create an entity that’s capable of performing complex tasks in an unpredictable world and defending itself against enemies - the ant colony analogy.
But for this to work, subtle feedback mechanisms are required. The individual ants need to be able to signal they are under pressure and can’t carry any more, otherwise the system becomes brittle and the colony collapses. @oetyng has been working on a system of message queueing and backpressure, which is a way for nodes to go, ‘Jeez back off will you? I’ll get to you in time but I’ve only got six legs, two mandibles and a tiny brain’. The code’s not complete just yet, but already tests are showing impressive improvements in stability and performance.
General progress
Team MaidSafe are getting to grips with the draft of the UK’s Online Safety Bill, which was released last week, and how it may impact us as a business as well as the Safe Network as a project. Our concerns about the Bill, which has always attempted to regulate the internet around Facebook, certainly have not been assuaged by the new draft; if anything, they’ve been made worse. So we’re working to understand what stances we might need to take, and what discussions we might need to have, to help the government understand that projects like ours aren’t Facebook, nor should we be treated like we are. Fortunately, our policy and governance manager @Heather_Burns has been dealing with this Bill for over three years in her previous jobs and understands it as well as anyone. She’s currently locked in a dark basement with over 500 pages of the Bill’s legal text and a crate of Irn Bru, and will report back soon.
In working through the DBC flows @danda and @davidrusu came to the realisation that the mint, as it was originally specced, was no longer needed, because the functionality - verifying and signing transactions - had now been built into the spentbook. As some sharp-eyed members of the community (hi @happybeing!) spotted, this means that whole swathes of code can be stripped out, leaving less work to do by both the client and the elders. We’re now debating whether to make the spentbook a separate data type - and perhaps whether to rename it ‘mint’ to fit with convention.
@Chriso is looking into the licensing of the codebase which has become inconsistent over time. The idea is to license the core network under GPL3 with non-safe network crates being licensed under MIT/BDS so as not to limit client apps that can be built on it.
@joshuef has also been working to integrate the dysfunction tracking code, which has exposed a couple of bugs in the node’s query handling. Prior to these fixes, nodes might not have returned a valid chunk to clients if another node responded faster with a fail. They may not have enqueued a peer if one already existed for the same chunk, and we may not have been resending queries out to nodes at all if the original messages were dropped in transit for any reason. Those couple of commits fix up that flow and appear to have had a reasonable impact on test results, which is nice.
Backpressure and message queueing
Because of limits in their CPU and memory, nodes cannot handle an infinite number of requests. Thus far, when they have collapsed under the strain we’ve simply killed them off, which results in lots of churn, even more messages flying about, and eventual failure, but backpressure - allowing a node to complain before the crunch point is reached - is a way of smoothing the curve.
So nodes can now push back and say, ‘Hey, I’m going under here, give me a break, only send me 10 messages within the next second’. The network will look proactively at what nodes say they’re capable of doing at a given point in time, and not deluge them with messages if they’re under stress. This will allow them to recover once they’ve finished their task.
Throttling the messages in this way also gives us time to prioritise messages, so if there’s something that is a top priority then that will still go through, whereas less important messages can wait.
Each node now has a message queue, which holds unsent messages as long as the node is considered live. If not, then the messages are dropped.
With this system all nodes are aware of the number of messages per second the other peers in their section are capable of receiving, as calculated by the back_pressure
module, and messages are prioritised so that important infrastructure messages are always sent before less important client service messages.
This is not live yet, but in testing the changes have yielded some really impressive results:
Store and read 5 MB from many clients with 50 client readers gave the following on the test branch:
Time: 30 s
CPU: ~40 % (very briefly above 50%)
Mem: ~60 MB / Elder (very briefly up to 125 MB)
compared to the results on main:
Time: 704 s
CPU: 100%, all the time
Mem: ~2 GB / Elder, all the time
All of which means happier and healthier ants.
Useful Links
Feel free to reply below with links to translations of this dev update and moderators will add them here:
Russian ; German ; Spanish ; French; Bulgarian
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!