Bugs can be hard to find, harder to eliminate and sometimes even harder to explain. In these updates we try to lay out the latest news on the progress we’re making and our plans for next steps, but in some ways that’s the easy bit. Like saying we’re making steady progress up a certain creek without saying how far away our destination is, how many paddles we have at our disposal, and ignoring the crocodiles, rapids, and other unpleasantness that lies in the way. The hard bit is explaining those bugs without getting lost in the weeds. It’s a dirty job, but in the interest of providing much-needed context, someone has to sift through the logs. @joshuef drew the short straw.
General progress
The API and CLI code has now been merged into the main Safe Network repo, though there is no new release just yet as there are some failing CLI tests. The release process also needs adjusted to take into account the additions to this repo. @Chriso is on the case.
Also tantalisingly close is the removal of Connection Pool from qp2p, with that functionality taken into Safe Network where we can fine tune it. The Connection Pool kept client connections open, but in a way that was hard to refine and configure as we want it. Removing it simplifies qp2p and removes a lot of edge cases - and almost certainly a lot of bugs.
Meanwhile @Joshuef flushed away a huge blocker this week managing to reduce message load in some circumstances (between good nodes), from ~65,000 down to ~500, all being well.
@bochaco and @yogesh have been digging into how sections keep a record of each other, how this process can be made more efficient, and where and in what format this information is stored.
And @Lionel.faber has been looking at prioritising message types. Some messages are more important than others. BLS DKG messages, which handle authorisation, should be given top priority. Nothing important should happen without agreement by the elders. Freeing the channels for these messages will speed everything up. At the other end of the spectrum, queries, data commands and error messages can happily wait their turn without affecting performance.
Bugs
I don’t think anyone ever claimed Safe to be simple. It’s not. But it’s not not either. We have the parts laid out as folk have seen in various testnets. And since the last one (which, we know feels a while ago), we’ve been hammering away trying to make everything more stable.
The bugs behind the instability are often touched on in updates, but in quite a techy fashion. So here we wanted to give a bit more of a general overview. Something a bit more accessible to folk who don’t like diving about in a text editor for hours at a time.
You have your classic bugs
2+2=5
Or dropped messages between nodes (your post doesn’t arrive).
Or a connection issue, where most of what you want arrives. But the screw you need did not get through. (And now you need to try and make that happen again, so you can see why that screw doesn’t arrive.)
Race conditions, which is where an issue might only arise if some code or program completes faster than another part of the system. (So perhaps you only see it if your horse LuckyProblems comes in just before OtherwiseWeWork and just after ThisAlreadyHappened; but any other combo goes along fine).
Loops. Things keep happening because they trigger things at the end. Possibly forever. They’ll often cause everything to hang or straight up crash because they keep taking up the program’s resources.
Hangs. Also known as deadlocks. These bugs are the Catch 22 of the bug world. You can continue only if you have number=5
, but you can only set number
if you have number=5
. This is obviously a symptom of a classic bug, but also often walks hand in hand with something race, so you don’t notice this until it’s too late (and now you aren’t really sure why this would be happening… .)
Then you have some more Safe specifics
Which are often just symptoms of the above…
Message amplification. This is when we might expect to get 5 messages through to our storing nodes, but instead get 500. Which in turn cause another 15000 to come back. There’s normally a bug in there (2+2=5) when we see this, or it can be that the system isn’t doing what we thought it would so we need to rethink the design. (We recently had AE-retries naively sent to all elders. To compound this, the next set of retries would therefore be sent from all elders…to all elders. )
Sometimes we get a lack of throughput. Messages aren’t dropped. But things are slow. Why!? Sometimes a combination of all of the above.
At the moment, after some refactoring we have too much throughput. Now this isn’t an issue by itself, but it can often expose various other issues… (take your pick from any of the bug types mentioned in this post!)
Forks! Forks in the path of our section knowledge (who came before us… who beget who)… if nodes don’t agree for some reason (a buggy reason), well then we can perhaps have two sets of valid knowledge, but don’t know which is actually relevant to our current situation.
Data not found. Is an obvious one… but why? Well, any of the above could lead to the data not actually being PUT in the first place. So good luck finding that which does not exist!
No split! We need splits to keep the network healthy (to split up the workload more easily and maintain resistance to hacks, for example). Not splitting might be a bug in the DKG algorithm (Distributed Key Generation… Or how we give our elders their authority).
Choosing the wrong target. Sometimes the messaging system seems to work and the parcel is delivered. But we’ve actually sent it to the wrong person (or sent it to a whole neighbourhood/section!?).
Being too excited! Sometimes we do something just as soon as we can. But the network, in its necessary route to eventual consistency, isn’t actually ready yet. (Imagine you PUT a chunk, but hasn’t all been stored yet, but you already try to GET.) It can seem like there’s a bug. But actually, if you try again in a few seconds, maybe it’s all there and fine. You thought you had a bug, but you were just too keen.
SooOOooo
So. That’s a wee rough rundown of various things we can see and come across in the system. That can be per-node, per client, or per section… And only sometimes, or only on a Tuesday on an obscure Linux build. And when you see the problem, it may be hiding beyond 3 or 4 different bug types, before you get to the root of the issue.
All of which we’re looking at in a system of 45 nodes and multiple clients (on average at the moment during internal testing).
Safe isn’t so complex when you think about it, conceptually at least (share data across computers). But it also isn’t as simple as it can be, yet, which is why we’re still chipping away at issues, refactoring things (making them simpler) as well as implementing new features (and sometimes they are aimed squarely at helping to debug).
Removing unnecessary code and complexity helps to get us to something simpler, which, alongside solving your classic bugs in the system, is often one of the most important ways to solve bugs. Less code, less problems.
We are getting there! It doesn’t always feel fast, but it always feels like we are pushing forwards (even when we sometimes need to go backwards a little bit).
Useful Links
Feel free to reply below with links to translations of this dev update and moderators will add them here:
Russian ; German ; Spanish ; French; Bulgarian
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!