We are taking down test4 now as it has served it’s purpose (we thought it would only last a few days at most, but it’s been more persistent than that).
Findings of TEST4 :
Breaking messages up into much smaller chunks to send them has been successful.
Using OS provided scheduling for socket handling (Mio crate) has driven down resources considerably.
Not sending GROUP_SIZE number of messages each time has reduced traffic considerably.
All in all we have reached a point where resource usage is more than acceptable for now. Also nice to see fully static binaries for Linux and ARM making a long overdue appearance.
We still have not focussed on data persistence, but that is about to change.
The next steps
We will now carry out more tests, over today and the weekend till Tuesday when TEST5 will hopefully start. We are looking at testing the following immediately on droplets before releasing to community again:
Swapping the service discovery to not start vaults if it find another close by. This is the reverse of it’s purpose, but for this test beneficial. In the last test we still had too many nodes per machine to the tests detriment.
Test rewrite of nat_traversal (should be as good as it was, but still needs to go further).
Never dropping messages related to data relocation, even under high load. We need to evaluate whether this will help prevent data loss or hurts the network by increasing traffic.
Re-enabling caching, which has been reimplemented to work with split messages now.
Plus potentially
Re-enable bootstrap-cache. This will allow nodes to find other bootstrap nodes and not required we keep droplets running. This will also mean community tests can continue almost automatically.
Reduce message chunks even lower than the 20KB we currently have.
Alter group & quorum sizes to confirm resilience
Two main issues we want to address fairly urgently:
Small capability nodes. This is when a vault is so poor that it can only damage the network. The network has to cut such nodes (they may fluctuate in capability, so non trivial). We have reduced the minimum requirement, but have not taken any steps to address the root issue of maintaining a minimum level of resource. The network should and will measure and maintain this.
Data retention - We require the network to be able to restart and republish all data. This is linked to 1. Above as potentially not all nodes require to store all data in the group (they should not have to). There are many advantages to this approach, but the focus is on data loss protection and security of data in this case. We will present further advantages here soon, for now though the focus is on data retention and security.
These may happen during Alpha and possibly not make it into Alpha 1, but this is what the tests are all about. Alpha though will include installers and more. So when we have a download here button we will be in Alpha, we will also let everyone know in advance.
Refactors currently happening (to catch up with test findings and fixes)
Routing client to be split from node in core.rs
Move much of the core.rs handling into a peer_manager to better and more simply handle peer related states in routing
safe_core passing back more information to API to allow launcher to display much better user feedback.
Given people’s interest at this stage, it would be useful to feedback to the hosts where their node is not useful… even if that is error in the terminal, that then raising awareness of where that bar lies… atm perhaps good intentions from those nodes or issues with them are not being addressed because of wishful thinking. Perhaps in future some indication of the estimated reward or a feedback encouraging a new user that their resources are sufficient to make a contribution, will help new users’ confidence that they should provide what they have to the network.
Test4 is the first time i have seen something that could potentially scale up and work as expected. I sincerely hope test5 can provide data retention and only leave optimization and additional features for future iterations.
I would hope that every weekly update would include an official message from the dev team about when to expect the MVP.
I hope you know how lucky you are to have such a dedicated community believing in what you are working towards contrary to all conventional signals.
Yes, 100% agree, this is the measurement we are looking for, when we find it (the algorithm to let the network choose) the node should exit with the correct error message, or similar.
That it can work is not in doubt… the idea is a good one, it’s just getting there is not trivial. Hence testing is iterative and resolving a more robust capability than will happen in a rush.
It would be interesting to get feedback simple suggesting that a node is in the bottom 20% of its peers… challenging all those to consider doing better. Alerting strong nodes that they are the backbone perhaps might also be useful, in the absence of safecoin feedback.
You are completely wrong. It is a stretch and the proof of it actually working in practice is quite far in the future. But we have some indications that it indeed might be possible.
A simple 1-10 scale would be a hugely significant feedback for the operator of a node. So everyone can be above average, like the kids at Lake Woebegone.
You’re too much a pessimist. The scope in which it can work is clearly large enough to be very optimistic. It’s just a matter of resolving the limits and making it robust so that it can resist real world normal challenges and then over time rare events.
Will there be a GUI for vaults someday. A simple GUI would solve this issue. No pressure, I know you guys are busy saving the world but a GUI would be nice.
Yes this is one of the things we need to be doing. The issue is what is the mark that we are 20% of if you see what I mean. It may be always the lowest 20% upload speed or similar, it may not be though.
So a node can drop messages if the queue starts to fill up, it gives us indications, but what happens is this is very dynamic. So massive churn and queues fill up. This is where it gets more difficult and has to be somewhat elastic. This is the route though. There is also a case where a node may not work in a particular part of the network but would be OK in others. So forced relocation is also a consideration.
Another (yea there are lots to consider, sorry ) is that a person can simply gobble up memory/bw etc. for a period, but the node is still good.
Timers obviously are bad in such conditions as the times are dynamic as well.
Anyhow this is what we are seeing and poking at right now. We have lowered the bar, but now need to find the dorp dead height. If folks then run too many nodes some will crash out and that’s good al round. Not there yet though, still measuring.
I would expect perhaps it needs to be consistent with whatever will be the factor that determines a node’s ability to earn safecoin in future… rewarding the sense of what kind of node you want to see more of.
So, my naive first pass would be each node having a simple log of their perception of others, which would be something like the product of the [connection uptime to that node] and [speed of work done by that other node]=[data|messages].[saved|retrieved].[speed of retrieval]. Then the sum of those other’s perceptions of me, becomes my indicator.
The bit where that falls down relative to safecoin perhaps is me-as-node summing other’s perceptions - but safecoin reward perhaps will be more complex and done by group consensus… for now, we just need the sum as indicator/feedback. It doesn’t even have to be 20% of something… just a number that we can wonder at. Bigger number an indicator of a problem… if it’s a log of N, then pruning the truly ugly nodes can be a haircut at logN=eek. If nodes are experiencing bad lag, then perhaps they should consider they are cutoff and do whatever they can to hibernate and revalidate as their environment improved. I’m without thinking, expecting that the work-done retrieving data, would somehow then acknowledge the reality of size of disk space actually usefully put to the network, instead of needing to measure that.
You can’t rely of the perception of one node, as their perception will reflect their own performance but I would expect the sum of all other peers, to better reflect the reality. Consensus in perception is not without value.
If there was a way to draw such network perception of self, I’d expect to see more art than science
Great idea. I’ve been looking into a gui for safe vault myself recently.
My idea is to have a web frontend for the vault. At this stage the log messages are ok for extracting info and displaying data in a webpage, but I think eventually an rpc module in the vault will be required. When I have a clearer idea of what calls would be useful I’ll start a topic with my findings.
How are these defined? Is such a definition set in stone as a static set of requirements (and what are those based on) or is it a function of dynamic variables? X amount of bandwidth * Y amount of disk space * Z amount of processor power = minimum requirement points? And what if a node is high in one area but low in another? Let’s say you needed 100 points to meet the requirement or you get cut from the network. Node A could have say 30 points worth of bandwidth, but only 10 in storage but another 60 worth in processor power, thus making the cut and onto the network. Node B could have only 20 points of processor power, another 20 in bandwidth but have 60 in storage, again making it onto the network. While node c only has 10 points of bandwidth, 15 of processor power and 20 worth of storage and ends up being disconnected. But the big question is how much is x, y and z worth or how much is each point worth in processor power, bandwidth and storage? And how is that valuation determined?
Yes an RPC mechanism would definitely be nice, to be able to request statistics and change parameters on the fly. Currently there is no user interaction with the vault at all. Which is nice for most users, but I’m a bit nosy and like to see what is going on inside the box