Update 10th October, 2024

image

This afternoon, we launched an updated network that requires a reset. Please follow the instructions as to how to do that.

This network is a second attempt at a giving a good community tyre kicking to the 32GB/4MB node/chunk combo that has proved most promising in internal testing.

Our first run was scuppered by the sudden failure or withdrawal of a large number of nodes. The ensuing surge of data replication then apparently knocked a lot of other nodes offline. While the network recovered somewhat after that, it was hardly the best test environment. A full post-mortem follows below.

Weā€™re also delighted to welcome @Gill_McLaughlin to the team to help us with our marketing and comms. Gill will be on hand to run through our Marketing Plans and Activations at our Discord Stage event on 22nd October.

And @bochaco has created a new docker-based app Formicaio for folks to try out to manage running nodes on the Beta network. As he says, itā€™s early days as yet, but initial reviews have been very positive.

So what happened with Tuesdayā€™s Network?

As you will know this network aimed to validate the best-performing combination of node and chunk from our internal lab tests, by putting it through its paces in real-world conditions.

We had a target storage utilisation benchmark to hit (how full we wanted the network to be) and with just a week to get there we had a relatively aggressive launch and data propagation schedule.

As you will have seen, it was a bit of a rocky ride, and therefore we couldnā€™t get the best data, or the clear picture we would have liked.

But what did go on? And was the 4MB chunk size to blame?

Well yes, but maybe no. Hereā€™s a blow-by-blow account with the juicy detail for those that want a peek behind the scenes, and to understand why we want another clean run at it:

Period 1: The Calm Before the Storm (13:45 UTC, Network Size: 1,975)

We started strong, with the network in a steady state. All MaidSade nodes were performing well:

  • Memory usage hovered around 190MB, with 1,010 files stored per node and a record store size of 1.28GB.
  • Each node had about 127 connected peers and 150 entries in their routing tables. Everything seemed stable.
  • On the hosting side, the nodes were using about 15% of CPU capacity, and memory usage was steady at 4.57GB out of 8GB available. Disk I/O was minimal, with less than 1MB/s reads and writes.

It was all smooth sailingā€¦ until the network started to grow.

Period 2: Rapid Expansion and Strain (13:45 - 14:30 UTC, Network Size: 28,000)

By 13:45 UTC, the network had begun to scale quickly, with the number of nodes exploding to 28,000 in a very short period. This rapid expansion led to immediate changes:

  • Memory usage per node increased to 270MB, and the number of files in the record store climbed slightly to 1,019.
  • The number of connected peers jumped to 325, while the routing table expanded to 233 peers.
  • CPU usage spiked dramatically to 99%, and memory usage on our hosts ballooned to 7.65GB out of 8GB. Disk I/O also surged to 1GB/s as the nodes struggled to keep up with the load.

As the network scaled, cracks started to show. We noticed an increase in the number of ā€œshunnedā€ peersā€”those marked as unreliableā€”and the ā€œbad peerā€ count started to rise. It was clear that the network was feeling the strain.

Period 3: Network Overload (14:30 - 15:40 UTC, Network Size: 73,600)

Things really hit a breaking point when the network size skyrocketed to 73,600. By this time, we began to see nodes falter under the pressure:

  • Memory usage per node dipped slightly to 225MB, but the number of chunks continued to grow, reaching 1,030.
  • The number of connected peers dropped to 286. Open connections also fell to 268, and CPU usage dropped to 43%.
  • Disk I/O, which had peaked earlier, decreased.

Period 4: Sudden Contraction (15:50 UTC, Network Size: 73,600 to ~10,000)

Just as the network seemed to be reaching a peakā€”a sudden and drastic contraction. At 15:50 UTC, the network rapidly shrunk from 73,600 nodes back down to 10,000 in a matter of minutes.

Weā€™re still investigating the exact cause of this sudden collapse, but we have some hypotheses. The rapid contraction hints at the possibility of a power user or a small number of larger operators being overwhelmed.

One theory is that a power user, responsible for a significant portion of initial rapid node increase, may have been overwhelmed by the replication load. With the number of relevant records starting from 700 to accommodate the surge from 2K to 70K nodes, itā€™s possible that their nodeā€™s bandwidth and resources were exhausted by the sheer volume of data being replicated. This might have caused them to pull the plug, resulting in the sharp drop in network size.

What We Observed: Lessons Learned

Looking back, itā€™s clear that a few factors contributed to the networkā€™s struggles:

  1. Resource Strain on MaidSafe Nodes: The memory and CPU resources allocated to our nodes were simply not enough to handle the rapid growth. About half of our initial seeded nodes crashed between Periods 2 and 3, creating a ripple effect across the network.
  2. Heavy Initial Load: Having 700+ relevant records to replicate right out of the gate put an unnecessary strain on both our nodes and the community nodes trying to join the network.
  3. Unexplained Network Spike: The sudden increase to 70,000+ nodes in just 1.5 hours was unexpected. Weā€™re still unsure if this was the entire community joining at once, or if a few power users caused the spike. Interestingly, 24 hours later, the network size increased again by a similar amount right before we decided to pull the plug. This behaviour supports the idea that a few large usersā€”or even automated processesā€”might have been responsible for these rapid changes in network size.

The rapid growth also led to higher memory usage as nodes struggled to maintain connections. This memory overload caused process terminations on many nodes, and the network started to collapse soon after.

What Weā€™re Changing for the Next Test

Weā€™ve learned a lot from this test and are already implementing several key changes for the next round:

  • Increased Resources: Weā€™re quadrupling the CPU and memory allocated to our droplets, while still limiting each to 25 nodes. Weā€™ve also added 120GB of disk space per droplet for paging (swapfile), just in case nodes use more memory than expected.
  • Smarter Data Replication: We wonā€™t overload the network with 700+ relevant records from the start. Instead, weā€™ll allow the network to fill gradually, giving nodes more time to handle the load. Smoother, steadier, especially while the network is youngā€¦ even if that is at the expense of less fullness towards the end of the test
  • Uploader Capacity: Weā€™re doubling our uploader capacity, allowing more uploader services to run on each droplet, and weā€™ll scale these uploaders more cautiously, checking system performance every six hours.
  • Stable Bootstrap Nodes: Our bootstrap nodes (which each host one node) performed well and didnā€™t hit resource limits, so weā€™ll keep that configuration unchanged.

Weā€™re confident that these changes will help ensure a smoother launch next time. And it also helps us build a battle-hardened approach to the launch sequence come go-time.

Weā€™ll continue to monitor the network closely and make adjustments as needed, but weā€™re looking forward to a more successful test run with your help!

General progress

@rusty.spork has been relaying tales of woe from the trenches, with community CPUs maxing out and nodes falling over, which is what initially alerted us to the problem.

@shuā€™s metrics showed when the problems started. @chriso and @joshuef also helped dissect the issues, and worked on some ways to solve them. Josh also got range based searching (Sybil resistance) functionality into the EVM based network code.

@anselme worked on API docs for JS to connect the autonomi API with the web front end, and wrapped up the register integration in the CLI, although there are still some issues with register GET from network.

@bzee continues to chip away at Wasm so that we can use that to connect to the autonomi API too. Itā€™s a slog, but heā€™s making progress there. Cost, putting data and getting data work now.

Meanwhile, Ermine been working with Metamaskā€™s transaction API that deal with smart contracts and pass metadata for transaction verification.

And @Mazzi has reset the Discord bot for the new network, as well as making UI improvements to the launchpad.

@qi_ma put in a security fix to prevent clients accidentally asking bad (shunned) nodes for a quote, made some improvements to the auditor, and worked on simplifying the quoting process.

And @roland has been mostly on continuous integration tests for the EVM-based network.

41 Likes

Are we adapting the testing pressure now to the stress the network can take in the current configurationā€¦?

Shouldnā€™t we consider that maybe the latest changes did destabilise it? We didnā€™t see those fatal results in networks with smaller node sizes and especially smaller chunk sizesā€¦?

Will the outside World behave the same when itā€™s no longer only maidsafe that is in control of upload rateā€¦? (/an undersea cable breaksā€¦ A hurricane/war causes largish power outages)

15 Likes

Thanks so much to the entire Autonomi team for all of your hard work! :sweat_drops:

And good luck to everyone with your Beta Rewards! :four_leaf_clover:

9 Likes

Decentralized network suffering from centralization - interesting thing to think about.
Maybe allowing single person to operate large amount of nodes is not that good of idea.

What they were doing if not storing more data?
9 additional files should not create such large load (theoretically).

9 Likes

You didnā€™t bring this up, and I donā€™t know enough about your data collection process to tell if itā€™s redundant, but there were a few conversations in the Discord that suggest that at least some node operators intentionally set their systems to run more nodes than they actually had disk capacity to support.

One hypothesis might be that a bunch of nodes collapsed when they ran out of real disk space.

14 Likes

Iā€™ll say this is highly unlikelyā€¦ A machine running out of space in that early stage of the game would have been over committed probably around x100 or so ā€¦ We were just starting the game and all my machines were super emptyā€¦

14 Likes

Thx 4 the update Maidsafe devs

welcome @Gill_McLaughlin

Just updated

:clap: :clap: :clap:

keep coding/hacking/testing supants :beers: :beer:

15 Likes

Not sure what Iā€™d do without the weekly updates. Keep ā€˜em coming.

9 Likes

This is a storage network, therefore there should be some install parameters of advice to the safenode operator to install their safenodes with the recommended amount of free unassigned space to be left open,

in order to facilitate the OS of choice and the SSD NVME or SATA drives themselves the space and time needed to handle the ā€˜out of bandā€™ defrag, garbage collection and wear levelling operations efficiently.

Every time a change is made to a file on flash drives, then the entire contents are read into memory, the changes applied (ie- add to the file, delete the file, rename the file) and then the file is re-written in a linear fashion by the OS File System as best as possible to keep writes fast and reads also fast.

n.b- In the case of the Media wear levelling (the conductive media coating wears out so data is moved to a new place on the media with a good conductive charge before the media loses the ability to hold a charge and represent 0s and 1s in binary form as + and - charges) , the FTL Flash Translation Layer keeps track of what condition the different locations media is in, which in most cases runs in the limited memory of the SATA Flash Controller or the NVME Drive on-board logic. For example the FTL logic is checking periodically the Smart data of the NVME measuring voltage levels to be close to accepted + - tolerances, and if the read measured is getting close to minimum tolerance, the FTL main loop logic runs a subroutine to move (read/copy, find good location and write to new good location) and then store a bad location record in the FTL logic table, so there are no more future attempts to write to that bad location.

The SSD vendor recommended amount of storage space to leave free for any Flash to keep your SSD Storage functioning properly(good write and read speeds) in the above regard, so as to support max concurrent peak loads is

30% free or unassigned SSD capacity (NVME or SSD).

Any node operator who attempted to run their safenodes with say less then that,
where the used space capacity rose beyond 70%

consuming excess space beyond the planned 30% unassigned reserve for the background operations

of the available space assigned to service the safenodes

will have run into to serious safenode write and read performance (response in all functions) degradation

These reductions in read/write performance to your SSD media is

largely Due to your system spending more time on background out of band functions

(Copy/read move/writes to keep the data linearized 'all blocks in a row per process session/app/safenode)

as opposed to your system spending more time on doing primary things that earn you nanos

(connecting, relaying information about your and other safenodes ā€˜stateā€™ to your close groups and other groups, storing/writing chunks, copying chunks in memory, logging activity, etcā€¦)

Having too much ā€˜out of bandā€™ activity going on,

Why? Because your unassigned space has shrunk below

the SSD vendor recommended

30% ā€˜left freeā€™ max capacity of 70% advice

will most certainly have contributed to your safenodes being SLOW to relay, update routing tables, make connections and store and read files locally,

causing such nodes to be shunned really quickly.

which means really in short,

Not enough unassigned space?

= no nanos

I hope this helps. :wink:

9 Likes

Welcome @Gill_McLaughlin from the community and myself, we are glad to see you join the Maidsafe team and hope you stay is both fruitful and enjoyable. Seems you have your job cut out for you.

14 Likes

No one asked me but Iā€™ll drop my 2 cents. Please I maybe critical at times, but I completely understand the stress everyone is going through and I am not judging anyone or anything like that. You are great devs and i have utmost respect for that. If you want I am that grumpy old man who thinks he knows it all, if that helps :wink:

If you do not know control theory then Iā€™ll try to use examples of other real world things that control system theory can explain because they are a sort of control system. You know you have an action and when you do something the action/system reacts to it. The reaction can be modelled with control system theory since essentially they are one.

The network was perturbed by some actions, and one of the parameters of the network control system is the chunk size. Why is this important?

  • it defines the upload rate of chunks from the node to the network. How long just one chunk takes to be sent.
  • Digital Ocean droplets are using highly optimised networking with 10Gb/s internal connections between servers (local and across the pond) and TB connections between data centres.
  • Home computers range from 10Mb/s to 5Gb/s connecting through ISP infra-structures which end up not being anywhere as good as DO. The vast majority of the home computers were on 20 to 100Mb/s connections
  • Some were using VPS similar to DO but still had to travel through the back bone of the internet to DO servers and home nodes.
  • the delay in uploading from home computers, even on 1Gb/s means the ā€œdelay of uploadingā€ parameter was greatly extended from internal testing

Now this all means that internal DO testing was great the delay parameter was quite low and did not show any issues, even when many chunks being uploaded at same time. Now increasing it because of many nodes joining and slower uplinks meant the parameter passed a critical point and an effect similar to fish tailing, or how a potters wheel will see a out of balance pot just fling about, or Garden hose flings about when the pressure increases

One reason we did not see this effect in the networks with 1/2MB and way in the past 1MB chunk sizes is that the upload was spread across many more nodes, so rather than 1/8th the nodes shouldering the churning load there was a much larger number. The delays caused were hugely less. The reason increasing it beyond a certain point is a problem is that the issue causes other imbalances to come into play, like packet losses increase, b/w limits are hit dramatically increasing the delay factor, and so on. Multiple chunks to be uploaded slow each other down and with large chunks it can make them go pass the timeout of the receiving node (hitting limits).

Basically in my opinion and knowledge of control systems and effects of positive feedback and lack of negative feedback and/or parameters too high, the 4MB max chunk size is not yet suitable for general home internet connections whether in EU with Gb/s or poorer parts with 10Mb/s (eg Germany or rural anywhere) or average 40-50Mb/s. In DO droplets yes its great and probably runs as smooth as butter and you could probably have 16MB chunk size run just as smoothly.

I showed how the elite Gb/s home internet is really only a tiny of the world based on the average upload speeds from testing sites. The relatively few with Gb/s lopside the figures hugely and these averages on the upload sites are not meant to represent 1/2 the people have higher and 1/2 have lower but rather the average overall speed available on the internet connections collectively.

Lets test 2MB max chunk size and see how the network control system reacts.

(my opinion) And nodes going bad are a symptom of the chunk size parameter being too high and NOT ā€œuser errorā€. Having 20K nodes drop off line all at once will happen often in a live network because of power outages, natural events, cable cuts, etc and if this causes a cascading effect of other nodes dropping out and b/w of peopleā€™s internet maxing out then the network cannot be considered stable, even if it ā€œrecoversā€ (note 20K maybe small for a million node network, but then such outages will affect a lot more nodes and remain that way till the network is adopted, but people will not adopt while its unstable :wink: )

Even in this new network each of my 120 nodes are reporting 30-50 other nodes as being bad. Yes my nodes are not being shunned as much as previously (0-3 shuns only each), but the bad node counts are still a sign that even with no uploads that the network is not healthy since my nodes see so many other as bad.

Keep up the great work, much appreciated and please consider testing 1MB and 2MB chunk sizes. Greatly needed.

Please consider this analysis and not invoke oh the users made mistakes and this caused issues like the initial part of the update was suggesting. A symptom of getting the idea a certain design/configuration is great and running too far with it. <ā€” my opinion based on 50+ years of programming and some of that in project management.

EDIT: update on the 120 nodes in the current network. They are now being shunned up to 25 times, which means that in effect because of the rate of shunning that nodes will end up segmented from parts of the network. If this happens too much on scale then the potential is for nodes more local (distance/network infrastructure wise) to each other being fine with each other and those with greater lag time not being so fine. IE the island effect where you see great connectivity between those on islands and some links between islands. This might take days/weeks to happen but is a serious possibility.

22 Likes

Excellent postmortem, thank you.
I have been keen to see this play out as joining established networks that hold data has always required far more finesse.

The nature of the game now is folk are lined up waiting for the trigger on a 100m sprint. It is not just a matter of joining a network but a race to win.

14 Likes

I havent taken part in this testnet for a while. didnt want to leave my desktop pc running all day and night, but just dug out and old microsoft surface go. its tiny but at least i am now back in the game.
Launchpad appears to have really improved since my last participation. well done.
it will only manage 2 nodes but it is what it is.

7 Likes

Thanks @neo
Excited to be here!

12 Likes

Yes, a good postmortem, utterly gripping, It reminds me of the timeline of events that lead to the explosion of reactor No. 4 at Chernobyl. :laughing:

7 Likes

Funny, I did see graphite on the ground shortly after 1600 hours, but thought nothing of it.

10 Likes

7 Likes

Welcome @Gill_McLaughlin! Exciting time to get involved.

I wasnā€™t sure where to post this, I donā€™t want to @ everyone, but at the same time really think itā€™s worth looking at - I warmly and hopefully encourage any @maidsafe team members who have any say over naming things to have a browse through this page:

A more inspiring and interesting collection of alternative naming schemes for many of the tech metaphors we take for granted I have not found. The page is meant as an overview of projects the ā€œmalleable systems collectiveā€ find interesting architecturally, but it really is a treasure trove of interesting names and concepts! Please someone consider having a browse.

[Disclaimer: Iā€™m biased towards any language which helps break down the ā€œdeveloperā€“>userā€ / ā€œcreatorā€“>consumerā€ / ā€œpatricianā€“>plebā€ fault lines.]

6 Likes

Compared to when the network was launched on Thursday, the increase in errors seems to have settled down. I have 80 nodes running, and while there were hundreds of errors per second yesterday, now there are only a few dozen per second.

6 Likes