This afternoon, we launched an updated network that requires a reset. Please follow the instructions as to how to do that.
This network is a second attempt at a giving a good community tyre kicking to the 32GB/4MB node/chunk combo that has proved most promising in internal testing.
Our first run was scuppered by the sudden failure or withdrawal of a large number of nodes. The ensuing surge of data replication then apparently knocked a lot of other nodes offline. While the network recovered somewhat after that, it was hardly the best test environment. A full post-mortem follows below.
Weāre also delighted to welcome @Gill_McLaughlin to the team to help us with our marketing and comms. Gill will be on hand to run through our Marketing Plans and Activations at our Discord Stage event on 22nd October.
And @bochaco has created a new docker-based app Formicaio for folks to try out to manage running nodes on the Beta network. As he says, itās early days as yet, but initial reviews have been very positive.
So what happened with Tuesdayās Network?
As you will know this network aimed to validate the best-performing combination of node and chunk from our internal lab tests, by putting it through its paces in real-world conditions.
We had a target storage utilisation benchmark to hit (how full we wanted the network to be) and with just a week to get there we had a relatively aggressive launch and data propagation schedule.
As you will have seen, it was a bit of a rocky ride, and therefore we couldnāt get the best data, or the clear picture we would have liked.
But what did go on? And was the 4MB chunk size to blame?
Well yes, but maybe no. Hereās a blow-by-blow account with the juicy detail for those that want a peek behind the scenes, and to understand why we want another clean run at it:
Period 1: The Calm Before the Storm (13:45 UTC, Network Size: 1,975)
We started strong, with the network in a steady state. All MaidSade nodes were performing well:
- Memory usage hovered around 190MB, with 1,010 files stored per node and a record store size of 1.28GB.
- Each node had about 127 connected peers and 150 entries in their routing tables. Everything seemed stable.
- On the hosting side, the nodes were using about 15% of CPU capacity, and memory usage was steady at 4.57GB out of 8GB available. Disk I/O was minimal, with less than 1MB/s reads and writes.
It was all smooth sailingā¦ until the network started to grow.
Period 2: Rapid Expansion and Strain (13:45 - 14:30 UTC, Network Size: 28,000)
By 13:45 UTC, the network had begun to scale quickly, with the number of nodes exploding to 28,000 in a very short period. This rapid expansion led to immediate changes:
- Memory usage per node increased to 270MB, and the number of files in the record store climbed slightly to 1,019.
- The number of connected peers jumped to 325, while the routing table expanded to 233 peers.
- CPU usage spiked dramatically to 99%, and memory usage on our hosts ballooned to 7.65GB out of 8GB. Disk I/O also surged to 1GB/s as the nodes struggled to keep up with the load.
As the network scaled, cracks started to show. We noticed an increase in the number of āshunnedā peersāthose marked as unreliableāand the ābad peerā count started to rise. It was clear that the network was feeling the strain.
Period 3: Network Overload (14:30 - 15:40 UTC, Network Size: 73,600)
Things really hit a breaking point when the network size skyrocketed to 73,600. By this time, we began to see nodes falter under the pressure:
- Memory usage per node dipped slightly to 225MB, but the number of chunks continued to grow, reaching 1,030.
- The number of connected peers dropped to 286. Open connections also fell to 268, and CPU usage dropped to 43%.
- Disk I/O, which had peaked earlier, decreased.
Period 4: Sudden Contraction (15:50 UTC, Network Size: 73,600 to ~10,000)
Just as the network seemed to be reaching a peakāa sudden and drastic contraction. At 15:50 UTC, the network rapidly shrunk from 73,600 nodes back down to 10,000 in a matter of minutes.
Weāre still investigating the exact cause of this sudden collapse, but we have some hypotheses. The rapid contraction hints at the possibility of a power user or a small number of larger operators being overwhelmed.
One theory is that a power user, responsible for a significant portion of initial rapid node increase, may have been overwhelmed by the replication load. With the number of relevant records starting from 700 to accommodate the surge from 2K to 70K nodes, itās possible that their nodeās bandwidth and resources were exhausted by the sheer volume of data being replicated. This might have caused them to pull the plug, resulting in the sharp drop in network size.
What We Observed: Lessons Learned
Looking back, itās clear that a few factors contributed to the networkās struggles:
- Resource Strain on MaidSafe Nodes: The memory and CPU resources allocated to our nodes were simply not enough to handle the rapid growth. About half of our initial seeded nodes crashed between Periods 2 and 3, creating a ripple effect across the network.
- Heavy Initial Load: Having 700+ relevant records to replicate right out of the gate put an unnecessary strain on both our nodes and the community nodes trying to join the network.
- Unexplained Network Spike: The sudden increase to 70,000+ nodes in just 1.5 hours was unexpected. Weāre still unsure if this was the entire community joining at once, or if a few power users caused the spike. Interestingly, 24 hours later, the network size increased again by a similar amount right before we decided to pull the plug. This behaviour supports the idea that a few large usersāor even automated processesāmight have been responsible for these rapid changes in network size.
The rapid growth also led to higher memory usage as nodes struggled to maintain connections. This memory overload caused process terminations on many nodes, and the network started to collapse soon after.
What Weāre Changing for the Next Test
Weāve learned a lot from this test and are already implementing several key changes for the next round:
- Increased Resources: Weāre quadrupling the CPU and memory allocated to our droplets, while still limiting each to 25 nodes. Weāve also added 120GB of disk space per droplet for paging (swapfile), just in case nodes use more memory than expected.
- Smarter Data Replication: We wonāt overload the network with 700+ relevant records from the start. Instead, weāll allow the network to fill gradually, giving nodes more time to handle the load. Smoother, steadier, especially while the network is youngā¦ even if that is at the expense of less fullness towards the end of the test
- Uploader Capacity: Weāre doubling our uploader capacity, allowing more uploader services to run on each droplet, and weāll scale these uploaders more cautiously, checking system performance every six hours.
- Stable Bootstrap Nodes: Our bootstrap nodes (which each host one node) performed well and didnāt hit resource limits, so weāll keep that configuration unchanged.
Weāre confident that these changes will help ensure a smoother launch next time. And it also helps us build a battle-hardened approach to the launch sequence come go-time.
Weāll continue to monitor the network closely and make adjustments as needed, but weāre looking forward to a more successful test run with your help!
General progress
@rusty.spork has been relaying tales of woe from the trenches, with community CPUs maxing out and nodes falling over, which is what initially alerted us to the problem.
@shuās metrics showed when the problems started. @chriso and @joshuef also helped dissect the issues, and worked on some ways to solve them. Josh also got range based searching (Sybil resistance) functionality into the EVM based network code.
@anselme worked on API docs for JS to connect the autonomi
API with the web front end, and wrapped up the register integration in the CLI, although there are still some issues with register GET from network.
@bzee continues to chip away at Wasm so that we can use that to connect to the autonomi
API too. Itās a slog, but heās making progress there. Cost, putting data and getting data work now.
Meanwhile, Ermine been working with Metamaskās transaction API that deal with smart contracts and pass metadata for transaction verification.
And @Mazzi has reset the Discord bot for the new network, as well as making UI improvements to the launchpad.
@qi_ma put in a security fix to prevent clients accidentally asking bad (shunned) nodes for a quote, made some improvements to the auditor, and worked on simplifying the quoting process.
And @roland has been mostly on continuous integration tests for the EVM-based network.