Let’s start with a non-reveal of one of the worst kept secrets in history (thanks to you pesky GitHub sifters). The L2 that we are using as the basis for our network token is … wait for it … Arbitrum.
Why Arbitrum? Well the first thing is to say that the choice is not set in stone, it can be changed. But it’s the one we have decided to go with because it has a solid reputation, great adoption and among the lowest transaction fees. It also has the highest market share of any L2, and funds can be sent directly from the likes of Robinhood or Coinbase with no cross chain bridge required, which is important for folks in the US and elsewhere with restrictive regulations - plus it’s supported by all respectable CEXs around the world. All in all, it ticks more boxes than any of the alternatives.
Another badly kept secret is that there will be no testnet today, as had been planned. Unfortunately it proved impossible to get stability improvements in place, plus the EVM integration, plus launchpad changes, plus necessary testing in the short time available, so rather than rush something out which would quite likely fall on its face in short order, we’ve decided to delay until Tuesday. Sorry to disappoint. But… we will also be sharing a lot more information with you on that day, so please be sure to mark your calendars on Tuesday with ‘Big Day’.
As you know, the last testnet took a nosedive after a mass shunning event, once again it recovered, but this is obviously not ideal behaviour and we need to sort it out. Here’s what happened.
Autonomi Network Report: Another Test, New Insights
As we entered the latest round of testing for the Autonomi Network, things started off with a steady and promising growth trajectory. However, our network encountered a sudden and dramatic series of events. Let’s walk through what unfolded and what we’ve learned from this latest test, and why ultimately, it will lead to a more resilient system, and a better launch.
Period 1: Steady Growth and Rising Activity (2024-10-10 to 2024-10-12)
The test began smoothly on October 10, with the network size growing steadily throughout the day. By the evening, we were seeing consistent growth—starting from just under 29,000 nodes and peaking at over 100,000 by the night of October 11. During this period, the average number of connected peers per node ranged between 700 and 900, with chunk upload rates progressively increasing as we steadily flowed through the system. By October 12, things were still looking solid. The network size held up around 112,000 nodes, and we were able to hold upload rates at a robust 5,800 chunks per hour. But as we would soon learn, this period of steady growth masked an underlying strain that would eventually lead to a rapid decline.
Period 2: The Network Begins to Strain (2024-10-12, 15:00 - 16:30 UTC)
At 15:00 UTC on October 12, the network still appeared healthy. Uploaders had been pushing at a high, but controlled rate — but with no increase. However, subtle signs started to emerge, that all was not well. While the network size remained around 111,000 nodes, with over 1,400 chunks uploading every 15 minutes, the network size began a slow but noticeable decline. By 16:00 UTC, we saw the first clear indications that something was off—the chunk upload rate was still high, but the number of connected peers per node was declining. This suggested that some nodes were struggling to maintain connections, and resources were starting to stretch thin.
Period 3: The Free Fall (2024-10-12, 16:30 - 17:15 UTC)
The situation deteriorated between 16:30 and 17:00 UTC. The network size dropped from 107,000 nodes to just 24,513 in less than 30 minutes. This was a sharp and sudden contraction, and the average number of connected peers per node skyrocketed—hitting a peak of 2,375 by 17:00 UTC. This indicated that many nodes were struggling to manage too many connections at once. It became clear that the network had been overwhelmed by the rapid growth and the load imposed on individual nodes, leading to widespread failures and a collapse.
Period 4: Recovery and Rebounding (2024-10-13)
After the crash, the network began a slow recovery. By the early hours of October 13, the network size had reached 9,500 nodes, and chunk upload rates had resumed, though at a far slower pace than before. Throughout the day, the network steadily rebounded, reaching over 65,000 nodes by 12:15 UTC, though it never fully recovered to its pre-crash peak. Interestingly, the network’s rebound after the collapse followed a highly linear growth pattern, almost as if it was rebuilding itself in a controlled manner. However, the damage from the initial crash was significant, and while we saw the network regaining size, the pace and stability were far from the levels we had seen earlier.
What Caused the Crash?
Looking at the data, we believe the crash may have been caused by a group of users who provisioned a large number of nodes with insufficient resources to handle the load — particularly CPU. These nodes may have struggled to keep up as the network size grew and the chunk upload rate surged, causing a domino effect that led to widespread node failures.At the same time, the network’s reliance on a growing number of connected peers per node put additional strain on the system. At its peak, some nodes had over 2,400 connected peers, far exceeding what was manageable for their available CPU and memory resources.We cannot adequately confirm or deny that 4MB chunk size was the leading factor on the cause of this crash given the data collected from our nodes, corroborating logs from the community. It is, however, clear that the rapid increase in resource consumption—particularly CPU and memory—was a major contributor to the network’s failure.
Next Steps: Improvements and Precautions
From this test, we’ve identified several key improvements that we’ll be implementing in future:
1. Managing Resource Limits:
We’re ensuring that nodes shut down gracefully if their CPU usage exceeds key limits for prolonged periods. This will prevent nodes from becoming overloaded and misbehaving, reducing the risk of network-wide issues.
2. Closing Bad Connections:
We’ll be closing connections from “shunned” nodes more aggressively to reduce the number of open connections and prevent resource drain.
3. Educating Node Operators:
We’ll work to better educate the community about not over-provisioning their nodes. Ensuring that nodes have the appropriate resources (CPU, memory, disk) is crucial to network stability.
4. Chunk Size Flexibility:
We’ll continue to allow the network to adjust chunk sizes, up to 4MB, when uploading to ensure flexibility in managing load and avoiding resource exhaustion.
Looking Forward
We’ve learned a lot from this latest test. While the crash was dramatic, it provided us with invaluable insights into how the network behaves under extreme conditions. Overall we’re happy that the adjustments we’re making will help ensure greater stability moving forward, and as always, we appreciate the community’s participation and patience as we continue refining the Autonomi Network.
This is exactly why we test.
General progress
Kicking off with Ethereum Virtual Machine (EVM) integration, Ermine has been focused on enabling EVM wallet creation via the CLI, while @anslelme ran a demonstration for the rest of the team in which everyone was able to add a test network to MetaMask and began earning tokens in real time. It was successful technically, but the UX needs polish.
To this end, @rusty.spork has been putting together some documentation to help people set up MetaMask in preparation for the EVM testnets. We’ll share those shortly.
@mick.vandijke worked on MetaMask integration, the CLI and third party signer integration for the API, and @mazzi has been adding EVM network code to the Launchpad
and integrating the Launchpad
with the various libraries to make it all play nicely together.
He’s also upgraded the Launchpad
status screen, adding suggestions like ‘free ports’ or ‘disable firewall’ on the error messages regarding retries.
@anslelme has been mostly on the API, adding some improvements in Archives including features required for a WebApp: metadata, renaming files and some usability improvements.
Also on the API but in Wasm-world, @bzee has been implementing the file/vault API in Wasm, testing it, and trying to get a headless browser test to work with this setup.
@shu and @chriso have been poring over the metrics looking for possible causes for the sudden shunning behaviour we’ve seen. Chris has also been focused on getting the latest release, which will contain Sybil resistance, EVM capabilities and stabilisation fixes, out the door.
These stabilisation fixes come courtesy of @joshuef, who submitted PRs to cause nodes to quit gracefully should their host machine be running dangerously low on CPU, and to remove connections to bad nodes; @roland, who was on tests for file benchmarking and @qi_ma who raised PR to prevent old data being brought into a new network during network reset.