I apologise for the response here as I know that there is so much pressure on all you wonderful people doing the development work and I have praise for the efforts and quality of work you do. And in the great scheme of things Autonomi this is a minor negative response in light of all the positives. I really cannot overstate the admiration I have for the work you all do.
This is mentioned because it important and potentially some data that will never show up in your collected metrics since you can only gather it by asking the community to give it to you. Also I do have some decades of experience working with comms and some things are so obvious to me, fundamental if you will.
“By October 12, things were still looking solid.” not so solid due to rate of shunning being high unlike anything we saw in all the networks prior to the previous 32/4 network
Did you not see the very large shunning rate happening during this period which set the stage for
The high shun rate experience throughout the test network up to now meant the network was not stable enough to endure a 10% loss of nodes. This one event caused the unstable network with reduced connectivity (due to vast amounts of shunning) to collapse until it ended up with the more stable (connectivity wise) nodes. Least shunned nodes perhaps
Due in no small part by the participants being eager beavers and with kids gloves coxed their nodes to rejoin the network. I doubt in a public setting that the general public would have been able to pull this recovery off with as much success and a world wide network would have ended up more segmented.
The “controlled manner” was the efforts of the participants including those with 10’s of thousand of nodes at their control. Would not see this scale of effort (50-75% of participants micromanaging nodes to life) in the real world.
I extremely doubt this because the nodes at that stage were barely at 2GB (most at 1GB), so in fact there was not the ability to have under provisioned resources and the nodes were more like the previous non 4MB max chunk size.
Would this not be due to the large shunning rate that had been occurring throughout the whole test and nodes scrambling looking for new good peers.
I hope you are not just taking the raw cpu% usage data. It is common practise for those using large machines to run some intensive tasks using “nice” to reduce those task’s hit on other tasks. In other words I could run nodes then this intensive stats task and have cpu at 100% all the time and yet the nodes would enjoy no impact from this.
Far better to use internal metrics in the node s/w that measures effort vs time, eg every major task is given a value for effort and the time taken to perform that feeds into a global value in the node and then using that one can determine the performance of the node. Then all you need to do is determine the lowest acceptable value for this global performance and that gives you a true measure rather than artificial measure of cpu load.
As a result of this effort measuring you also end up measuring the results of swapping, disk access, and other non-cpu effects that affect performance of the node. A far better measure of overall performance of the node. All for adding time measurement of major tasks and assigning a relative “effort” value.
You can also add in the disk space allocation scheme I suggested to help those people without the knowledge to not over provision nodes on their machine. (ie under provision disk resource) Its a simple method and effective. Yes those with knowledge can bypass that but one would hope they have the knowledge to close nodes as space usage increases. But is just one small way to ensure for the vast majority do not run out of disk space.
And this will continue to have instability due to all the issues that comes with using real life communication networks. I am serious here, 1/2 MB worked fine, 4MB is increasing the error rate, for one (at least 64 times the problems), too far for stability in the network design. We saw that in the shun rate being very high (shunning of nodes and nodes shunning others) and sometimes 2 or more shuns per hour. This is in line with the comms error rates one expects when using such huge data blocks over the internet comms.
HINT: why do you think packet sizes on the internet remain for the most part at 1500 Bytes - better error recovery, less delays, better performance. Sending data blocks over the internet also perform better with smaller sizes, more error checking quick error recovery and so on. This is why increasing data block size increases problems by more than the square of the size increase. Especially with udp.
You are in the realm of expect issues due to the max record size being 4MB - Have fun keeping it at that.
HINT: this is the cause of some resource limits being hit (b/w etc)
I predict the network will work with max record size of 4MB if people are careful and handle it with “kid gloves” but once the less knowledgable public start using it then expect the garden hose effect to continue.
Unfortunately this analysis, as good as it is and effort gone into it, has missed some very important data points that were gathered by metrics and lack of long term comms experience. And the attempt to maximise data flow has fallen into the trap that many programmers have fallen into over the decades when sending data over comms
Please oh please provide very detailed documentation on how to set this up for us to use, there are a number of us testers who have NEVER used ETH or ERC20 or L2 and are clueless at this time and the docs out there range from garbage to ok. Too long to sort through to get up and running in a short time period.