Update 17th October, 2024

Let’s start with a non-reveal of one of the worst kept secrets in history (thanks to you pesky GitHub sifters). The L2 that we are using as the basis for our network token is … wait for it … Arbitrum.

Why Arbitrum? Well the first thing is to say that the choice is not set in stone, it can be changed. But it’s the one we have decided to go with because it has a solid reputation, great adoption and among the lowest transaction fees. It also has the highest market share of any L2, and funds can be sent directly from the likes of Robinhood or Coinbase with no cross chain bridge required, which is important for folks in the US and elsewhere with restrictive regulations - plus it’s supported by all respectable CEXs around the world. All in all, it ticks more boxes than any of the alternatives.

Another badly kept secret is that there will be no testnet today, as had been planned. Unfortunately it proved impossible to get stability improvements in place, plus the EVM integration, plus launchpad changes, plus necessary testing in the short time available, so rather than rush something out which would quite likely fall on its face in short order, we’ve decided to delay until Tuesday. Sorry to disappoint. But… we will also be sharing a lot more information with you on that day, so please be sure to mark your calendars on Tuesday with ‘Big Day’.

As you know, the last testnet took a nosedive after a mass shunning event, once again it recovered, but this is obviously not ideal behaviour and we need to sort it out. Here’s what happened.

Autonomi Network Report: Another Test, New Insights

As we entered the latest round of testing for the Autonomi Network, things started off with a steady and promising growth trajectory. However, our network encountered a sudden and dramatic series of events. Let’s walk through what unfolded and what we’ve learned from this latest test, and why ultimately, it will lead to a more resilient system, and a better launch.

Period 1: Steady Growth and Rising Activity (2024-10-10 to 2024-10-12)

The test began smoothly on October 10, with the network size growing steadily throughout the day. By the evening, we were seeing consistent growth—starting from just under 29,000 nodes and peaking at over 100,000 by the night of October 11. During this period, the average number of connected peers per node ranged between 700 and 900, with chunk upload rates progressively increasing as we steadily flowed through the system. By October 12, things were still looking solid. The network size held up around 112,000 nodes, and we were able to hold upload rates at a robust 5,800 chunks per hour. But as we would soon learn, this period of steady growth masked an underlying strain that would eventually lead to a rapid decline.

Period 2: The Network Begins to Strain (2024-10-12, 15:00 - 16:30 UTC)

At 15:00 UTC on October 12, the network still appeared healthy. Uploaders had been pushing at a high, but controlled rate — but with no increase. However, subtle signs started to emerge, that all was not well. While the network size remained around 111,000 nodes, with over 1,400 chunks uploading every 15 minutes, the network size began a slow but noticeable decline. By 16:00 UTC, we saw the first clear indications that something was off—the chunk upload rate was still high, but the number of connected peers per node was declining. This suggested that some nodes were struggling to maintain connections, and resources were starting to stretch thin.

Period 3: The Free Fall (2024-10-12, 16:30 - 17:15 UTC)

The situation deteriorated between 16:30 and 17:00 UTC. The network size dropped from 107,000 nodes to just 24,513 in less than 30 minutes. This was a sharp and sudden contraction, and the average number of connected peers per node skyrocketed—hitting a peak of 2,375 by 17:00 UTC. This indicated that many nodes were struggling to manage too many connections at once. It became clear that the network had been overwhelmed by the rapid growth and the load imposed on individual nodes, leading to widespread failures and a collapse.

Period 4: Recovery and Rebounding (2024-10-13)

After the crash, the network began a slow recovery. By the early hours of October 13, the network size had reached 9,500 nodes, and chunk upload rates had resumed, though at a far slower pace than before. Throughout the day, the network steadily rebounded, reaching over 65,000 nodes by 12:15 UTC, though it never fully recovered to its pre-crash peak. Interestingly, the network’s rebound after the collapse followed a highly linear growth pattern, almost as if it was rebuilding itself in a controlled manner. However, the damage from the initial crash was significant, and while we saw the network regaining size, the pace and stability were far from the levels we had seen earlier.

What Caused the Crash?

Looking at the data, we believe the crash may have been caused by a group of users who provisioned a large number of nodes with insufficient resources to handle the load — particularly CPU. These nodes may have struggled to keep up as the network size grew and the chunk upload rate surged, causing a domino effect that led to widespread node failures.At the same time, the network’s reliance on a growing number of connected peers per node put additional strain on the system. At its peak, some nodes had over 2,400 connected peers, far exceeding what was manageable for their available CPU and memory resources.We cannot adequately confirm or deny that 4MB chunk size was the leading factor on the cause of this crash given the data collected from our nodes, corroborating logs from the community. It is, however, clear that the rapid increase in resource consumption—particularly CPU and memory—was a major contributor to the network’s failure.

Next Steps: Improvements and Precautions

From this test, we’ve identified several key improvements that we’ll be implementing in future:

1. Managing Resource Limits:
We’re ensuring that nodes shut down gracefully if their CPU usage exceeds key limits for prolonged periods. This will prevent nodes from becoming overloaded and misbehaving, reducing the risk of network-wide issues.
2. Closing Bad Connections:
We’ll be closing connections from “shunned” nodes more aggressively to reduce the number of open connections and prevent resource drain.
3. Educating Node Operators:
We’ll work to better educate the community about not over-provisioning their nodes. Ensuring that nodes have the appropriate resources (CPU, memory, disk) is crucial to network stability.
4. Chunk Size Flexibility:
We’ll continue to allow the network to adjust chunk sizes, up to 4MB, when uploading to ensure flexibility in managing load and avoiding resource exhaustion.

Looking Forward

We’ve learned a lot from this latest test. While the crash was dramatic, it provided us with invaluable insights into how the network behaves under extreme conditions. Overall we’re happy that the adjustments we’re making will help ensure greater stability moving forward, and as always, we appreciate the community’s participation and patience as we continue refining the Autonomi Network.

This is exactly why we test.

General progress

Kicking off with Ethereum Virtual Machine (EVM) integration, Ermine has been focused on enabling EVM wallet creation via the CLI, while @anslelme ran a demonstration for the rest of the team in which everyone was able to add a test network to MetaMask and began earning tokens in real time. It was successful technically, but the UX needs polish.

To this end, @rusty.spork has been putting together some documentation to help people set up MetaMask in preparation for the EVM testnets. We’ll share those shortly.

@mick.vandijke worked on MetaMask integration, the CLI and third party signer integration for the API, and @mazzi has been adding EVM network code to the Launchpad and integrating the Launchpad with the various libraries to make it all play nicely together.

He’s also upgraded the Launchpad status screen, adding suggestions like ‘free ports’ or ‘disable firewall’ on the error messages regarding retries.

@anslelme has been mostly on the API, adding some improvements in Archives including features required for a WebApp: metadata, renaming files and some usability improvements.

Also on the API but in Wasm-world, @bzee has been implementing the file/vault API in Wasm, testing it, and trying to get a headless browser test to work with this setup.

@shu and @chriso have been poring over the metrics looking for possible causes for the sudden shunning behaviour we’ve seen. Chris has also been focused on getting the latest release, which will contain Sybil resistance, EVM capabilities and stabilisation fixes, out the door.

These stabilisation fixes come courtesy of @joshuef, who submitted PRs to cause nodes to quit gracefully should their host machine be running dangerously low on CPU, and to remove connections to bad nodes; @roland, who was on tests for file benchmarking and @qi_ma who raised PR to prevent old data being brought into a new network during network reset.

49 Likes

Thanks so much to the entire Autonomi team for all of your hard work! :muscle: :muscle::muscle:

Also, good luck to everyone with your Beta Rewards! :four_leaf_clover:

This update was really long, it was like reading “War and Peace!” I see the team is really putting the work in. :joy:

22 Likes

An excellent update - good to see the analysis of the crash, even better to see the planned mitigations.

Excellent decision to delay the new test network until Tuesday. Similar thought should be given to the “launch”.

Sincere thanks to each and every one of the team for getting us to where we are. Now lets think very carefully about the next steps and not spoil 18 yrs of work by over-hasty rush to launch.

29 Likes

Have to agree with @Southside are we really going to “launch”in 2 weeks without a stable test running now?

11 Likes

What does this mean?

4 Likes

They were wrong about the 4MB chunk sizes, but dont want to admit it?

Runs away very quickly…

7 Likes

I read it this as sticking with 4mb because chunks are always smaller if the uploaded data is smaller than 12mb… While somewhat saying the large chunks are part of the cause of the network collapse …?

4 Likes

I believe that this should never have to be a part to provide a stable network. Should people want to harm the network, they can simply choose to ignore this, right? Or better said, intentionally run bad nodes that drop connection at the same time. The network should not only survive this, but not experience any difficulty in the process?

23 Likes

This update is even slightly shocking, a big round of applause for the team :clap: :clap: :clap: for a sensible approach to problems and tests, I think we are back on track :point_left: :blush:

And great appreciation for your hard work and unrelenting pace :ok_hand:

8 Likes

Personally I think the explanation and proposed mitigations leave a lot to be desired. The cause isn’t really known and the changes don’t seem to address the assumed cause - at least it’s not clear to me how they do.

The cause was either too much CPU squeeze for too many nodes in a short time, or the large chunks, or the former caused by the latter (or mass daft provision by ‘rogue’ community members).

But if this was due to too many connections, which is also highlighted, why not limit that rather than shutting down nodes when CPU use is high?

Is shutting down a struggling node that different to one struggling to handle too many connections?

I admit I’m being an armchair expert here, so maybe wrong, but the explanation doesn’t make sense as it stands, at least to me.

I agree in principle with what @Mightyfool says about education but I think the aim is probably to help run useful tests. Later they can ask the community to act daft again to test the mitigations. Which again speaks to the need for more time - we aren’t at the point where the network is ready to be tested to that level. Far from it from the look of things.

To me this and the postponed test, which I thought was incredibly optimistic to promise in such a short time while still trying to integrate a large change to many areas (EVM integration), is more evidence that the network is far from ready and that the team feels pressured to deliver in too short a timescale.

Or in other words, still a mess. Where’s that video of changing the wheel on a moving vehicle?

This is a difficult thing to do to say the least and it should not be being rushed. I feel for the dev team here because they are not the problem. They are doing incredibly well in the circumstances.

Same for the API - there will be consequences down the line for not designing it properly, and not letting developers test it and the network by using it, before both are launched.

21 Likes

Yes, people have every right to plug as much as they want into their sockets and not take the load on their house electricity supply into account. Even to the extent of using a big nail instead of a fuse. If they burn down their flat and the whole building then so be it. It would be entirely wrong to educate them at all.

2 Likes

I read it as there was a problem with resource exhaustion and that would manifest itself no matter the chunk size, but 4 MB chunks made it worse and more visible.

What i see as biggest problem with the current situation is that the most economical way of running nodes is bad for the network, it is not even an attack. You just run as many nodes as you can and when network load rises you start killing them - which creates more load and a cascading effect.
We cannot ask people to “play nice and earn less”, that won’t work.

I know node age was discarded because of complexity, but I think we need something “node age light” mechanism to make the network more stable and prevent too much opportunism.

Different situation. Currently there is no penalty for running oversubscribed server and crashing it in the worst moment.

7 Likes

I fully agree and I was actually just thinking about this myself, but I consider myself not knowledgeable enough to speak on this front. However, from a purely non technical standpoint I can see the benefits, trusting nodes that have been with us longer will instantly mitigate the risk of people scaling the network fast to only crash it shortly after. When nodes have to gain trust, there is a cost involved in getting the nodes trusted enough that could eventually negatively impact the network when falling away. Sounds to me as if that could be seen as a PoS sort of safeguard mechanism?

4 Likes

We’ve never had huge cascading outages with small nodes and small chunks and now we’re discussing adding additional complexity with node age to get those “unavoidable situations” under control…?

9 Likes

That was my whole point! There’s no disincentive to overdoing it on CPU, RAM or disk under provisioning and @Mightyfool is saying the network should just cope (which ideally it should) so there is no need to educate users about what is sensible that will keep their nodes running and not crash the network. I think a bit of education would be a good idea.

3 Likes

… I guess it would be enough to make sure people having reserves and who’s nodes respond fast are being paid better than nodes that are responding with a few (milli)seconds delay… So paying the ones better that help the network to be strong…

Maybe by not paying the one with the lowest quote… But paying the fastest… Possibly the price of the lowest quote (?) @joshuef @dirvine

1 Like

Do the larger chunks have anything to do with the introduction of the erc20? Do they have to be bigger to accommodate it I mean?

3 Likes

I like this idea - the original design was what got me interested in the project to start with.

I agree with this - any changes to a node, can and will be coded out by a malicious or opportunistic actor. Adding CPU limits to shutdown a node, that’s a line of code to edit and remove - pointless wasting dev time on it - these sort of safeguard, even the shunning need to be done by a nodes peers.

The network needs to be able to cope with these situations of over provisioning, unintentionally or otherwise - asking users to do it just set’s another barrier to adoption.

Sounds like introducing validators / auditor roles with consensus into the network - not a bad idea - kind of like a P2P blacklist of IP’s that the network won’t talk too - but verified by nodes… maybe like the shunning process, but distributed - so when a node thinks a peer is bad, instead of just shunning it locally - it asks its peers to check the bad node as well - the consensus response causes a bad node to be removed much quicker from both the shunning node, and all it’s peers.

1 Like

Don’t think so - the large chunks are just to enable home runners to provide more storage while not needing to upgrade to a special router…

2 Likes