Update 17th October, 2024

Knosis · October 17, 2024, 5:46pm

I saw much higher CPU usage with 4m chunks. To the point where I had to slowly increase the number of nodes as it would max out my CPUs.

What happens when the nodes require heavy service? They max out the CPU again.

It’s hard to understand why there is no acknowledgment that the 4m chunks have caused serious issues.

peca · October 17, 2024, 5:46pm

Education is nice, when money is in the game people will do what brings them most.

As I understand it there was also never the amount of upload as in this testnet.

I like this idea!

storage_guy · October 17, 2024, 5:50pm

The point was about education. Not code to enforce it. Do we want to say to users, ‘bear in mind CPU and RAM can fluctuate or even double’. Or just say run nodes and they cram as many onto machines as possible and increase instability.

storage_guy · October 17, 2024, 6:10pm

No, that’s the number of chunks you are thinking of.

storage_guy · October 17, 2024, 6:15pm

Larger chunks make uploads more efficient. And also help lower the transaction costs of using ERC20 L2.

But likely cause data transfer issues that @neo has written about in some depth. They might be a contributing factor for CPU usage as re transmissions are required.

Those potential issues aren’t really being discussed!

Jadkin · October 17, 2024, 6:16pm

I think it doesn’t matter what we tell users - they will make their own decision…When Crypto / $$ are involved the overwhelming option historically is to run more nodes = more $$, at the expense of network reliability, and ultimately killing of the thing that we want to happen for short term gains.

It’s impossible to tell users what hardware they need to run - the uploads, and downloads are not under control - imagine saying you need 1G memory, and 10Mbps internet per node, only to find out that people start downloading heavily from the network - the 10Mbps is maxed out, and CPU is flat lined in disk I/O waits while 100’s of different chunks are read randomly from disk - we then start shunning nodes, and the network collapses.

No the way this will survive is the network monitors for bad nodes, and just put’s them in containment - shunning has always been a bad design, and it’s proved itself with the failure we have seen - it makes no sense to kick a node out of the network, which “might” have good data on it, just because it’s slow for a few minutes. Containment of the node, auditing, validation and blacklisting the bad offenders will provide a more robust network.

I would refer back to the RFC’s but they have been deleted from Github ? where did the designs go, or am I looking in the wrong place.

Anyhow, this is immaterial - All it shows is the design (not that i can find it now) is a long way from being release ready, I really hope someone makes the call to push the schedule to the right a few months to give some Dev time to breath, and us more time to test these things.

Lisa_Brown · October 17, 2024, 6:20pm

Despite some of the what I see as negativity, to others as constructive criticism. I can’t wait to see this thing in the wild. Only then will we all really know if it’s real or not.

I just want to upload my photos. Hurry up

Bogard · October 17, 2024, 6:43pm

Who writes this stuff?

Really? What dramatic change was made recently that many node operators have reported being a strain on their setups? Maybe going from 500 kb to 4 Mb chunk sizes?

You can’t possibly be serious. When you make a mistake, you admit it, fix it and move on. What kind of drivel is this this late in the game? You had a stable network, then you break it and you can’t find your way back to that stable network by simply going back to a 500 Kb chunk? You can make massive changes like going ERC to speed “beta” but you can’t simply go back to 500 Kb chunk size for now to restore network stability?

Deadlines are completely fine when the team doesn’t self sabotage or wet the freaking bed. These are rookie mistakes. Grow up.

riddim · October 17, 2024, 6:44pm

Or possibly a slightly different algorithm… Paying the fastest node if it’s closer in xor space… So pretty hard to game/fake and at the same time a slowish home node won’t loose out completely… While the well provisioned node will earn significantly better than the slow responding overloaded node…

Southside · October 17, 2024, 6:44pm

Better keep your own local backup if you demand it so quick.

rreive · October 17, 2024, 7:30pm

Any decent provisioning software checks resources first,

its ITAM 101,

inventory the resources/assets,

then have the provisioning app

in this case node-launchpad,

with LUTS (to control GIGO and the end node-collective node performance in the worst use case imagined)

effectively creating the sandbox of node-launchpad UI Node operator choices for:

number of nodes,
(given CPU, RAM, diskspace, bandwidth found)

which sets a sliding scale of 20-30%

unassigned disk space,

and a sliding scale

number of connections total that can be supported among nodes

effectively handle the peak load use case when all nodes are fully engaged doing everything, all at once…

The individual configuration categories of the provisioning software are interdependent

That said, this recent test wave helped the dev team figure out now how to do the above

so the network adjusts accordingly because the collective system of nodes making up the Autonomi Network never gets overprovisioned in the first place.

Erwin · October 17, 2024, 7:59pm

Lets wait for a second here.

It’s -suspected- that it was caused by node operators over provisioning nodes, but we don’t know anything for sure.

This could just as well be caused by a slightly failed failover of a core router @hetzner or a housekeeping script gone wrong that affected vps users or the cleaning lady pulling the fibre out of the adva.

riddim · October 17, 2024, 8:02pm

I think you got that right

Erwin · October 17, 2024, 9:09pm

Weird stuff -does- happen in the datacenter

happybeing · October 17, 2024, 9:12pm

Ok folks, own up, who has been shouting at their nodes?!

Maybe try talking nice next time?

tobbetj · October 17, 2024, 10:44pm

I get a feeling that larger chunks might have something to do with the performance of the ERC20 L2, the throughput. Wanting larger chunks for better througput as ERC20 L2 might suffer from limitations.

philip_rhoades · October 17, 2024, 10:54pm

Man things have sure changed around here!

I have been here for some years and had gotten used to the “get it right before launching” attitude - even though, like everyone else here, thought the SN was a great idea and was itching - even desperate (for our own projects) to see it go live . .

Over those years I have periodically been doing testing when I have time and when I have done that, I have noticed a steady progression to the end objective . .

So I am sort of gobsmacked to see the dramatic change in direction - as well as being quite nervous about the change! - what if there is major fail after launch? - that could be a disaster for the whole project?

neo · October 17, 2024, 10:59pm

I apologise for the response here as I know that there is so much pressure on all you wonderful people doing the development work and I have praise for the efforts and quality of work you do. And in the great scheme of things Autonomi this is a minor negative response in light of all the positives. I really cannot overstate the admiration I have for the work you all do.

This is mentioned because it important and potentially some data that will never show up in your collected metrics since you can only gather it by asking the community to give it to you. Also I do have some decades of experience working with comms and some things are so obvious to me, fundamental if you will.

“By October 12, things were still looking solid.” not so solid due to rate of shunning being high unlike anything we saw in all the networks prior to the previous 32/4 network

Did you not see the very large shunning rate happening during this period which set the stage for

The high shun rate experience throughout the test network up to now meant the network was not stable enough to endure a 10% loss of nodes. This one event caused the unstable network with reduced connectivity (due to vast amounts of shunning) to collapse until it ended up with the more stable (connectivity wise) nodes. Least shunned nodes perhaps

Due in no small part by the participants being eager beavers and with kids gloves coxed their nodes to rejoin the network. I doubt in a public setting that the general public would have been able to pull this recovery off with as much success and a world wide network would have ended up more segmented.

The “controlled manner” was the efforts of the participants including those with 10’s of thousand of nodes at their control. Would not see this scale of effort (50-75% of participants micromanaging nodes to life) in the real world.

I extremely doubt this because the nodes at that stage were barely at 2GB (most at 1GB), so in fact there was not the ability to have under provisioned resources and the nodes were more like the previous non 4MB max chunk size.

Would this not be due to the large shunning rate that had been occurring throughout the whole test and nodes scrambling looking for new good peers.

I hope you are not just taking the raw cpu% usage data. It is common practise for those using large machines to run some intensive tasks using “nice” to reduce those task’s hit on other tasks. In other words I could run nodes then this intensive stats task and have cpu at 100% all the time and yet the nodes would enjoy no impact from this.

Far better to use internal metrics in the node s/w that measures effort vs time, eg every major task is given a value for effort and the time taken to perform that feeds into a global value in the node and then using that one can determine the performance of the node. Then all you need to do is determine the lowest acceptable value for this global performance and that gives you a true measure rather than artificial measure of cpu load.

As a result of this effort measuring you also end up measuring the results of swapping, disk access, and other non-cpu effects that affect performance of the node. A far better measure of overall performance of the node. All for adding time measurement of major tasks and assigning a relative “effort” value.

You can also add in the disk space allocation scheme I suggested to help those people without the knowledge to not over provision nodes on their machine. (ie under provision disk resource) Its a simple method and effective. Yes those with knowledge can bypass that but one would hope they have the knowledge to close nodes as space usage increases. But is just one small way to ensure for the vast majority do not run out of disk space.

And this will continue to have instability due to all the issues that comes with using real life communication networks. I am serious here, 1/2 MB worked fine, 4MB is increasing the error rate, for one (at least 64 times the problems), too far for stability in the network design. We saw that in the shun rate being very high (shunning of nodes and nodes shunning others) and sometimes 2 or more shuns per hour. This is in line with the comms error rates one expects when using such huge data blocks over the internet comms.

HINT: why do you think packet sizes on the internet remain for the most part at 1500 Bytes - better error recovery, less delays, better performance. Sending data blocks over the internet also perform better with smaller sizes, more error checking quick error recovery and so on. This is why increasing data block size increases problems by more than the square of the size increase. Especially with udp.

You are in the realm of expect issues due to the max record size being 4MB - Have fun keeping it at that.

HINT: this is the cause of some resource limits being hit (b/w etc)

I predict the network will work with max record size of 4MB if people are careful and handle it with “kid gloves” but once the less knowledgable public start using it then expect the garden hose effect to continue.

Unfortunately this analysis, as good as it is and effort gone into it, has missed some very important data points that were gathered by metrics and lack of long term comms experience. And the attempt to maximise data flow has fallen into the trap that many programmers have fallen into over the decades when sending data over comms

Please oh please provide very detailed documentation on how to set this up for us to use, there are a number of us testers who have NEVER used ETH or ERC20 or L2 and are clueless at this time and the docs out there range from garbage to ok. Too long to sort through to get up and running in a short time period.

neo · October 17, 2024, 11:38pm

Good question, they do seem hell bent on 4MB and I suspect wanted even larger, but even in their optimised local DO network saw greater sizes causing issues.

When you asked this I realised it is most likely to do with trying to reduce the number of transactions needed for large files. Imagine uploading a 100GB video file with 1/2MB chunks. That is 200,000 chunks and approx 780 smart contracts to execute (256 chunks per smart contract). At 4MB max chunk size it is 25,000 chunks and less than 100 smart contracts to execute.

Nope that is controlled by node size not chunk size.

Not sure what you are getting at since shunning is blacklisting the node and containing it outside the network. Also the good data is not an issue since there are so many copies kept and replication ensures the data on a bad node is replicated.

Remember the network is running where each node is the sole arbiter of its destiny and how it views other nodes. If it follows simple rules then other nodes will be happy to talk to it. If not it is shunned and blacklisted.

To get shunned takes more than being slow for a few minutes. There is a process in the other nodes before it gets blacklisted. At this time from memory its 3 strikes over a period of time. The strikes fall away if its not blacklisted.

So far I only see people trying to give constructive criticism, as I also try to do.

Agreed, the big change and effects happen, but lets keep to the big change and hope more minor tweaks work. I have worked with comms and know this change was adventurous and the multiplying of issues that come with it took the network to the edge of the cliff and like in the road runner comics the cliff edge gave way from under the coyote

And favours the data centre until all the nodes except a small %age end up in optimised networked data centres. Also the fastest is more determined by location. So the countries with the largest number of uploaders will end up with their data centres being populated with nodes.

The speed is not the issue and you also are only looking at the first packet. Speed of response is not necessarily a measure of throughput or actual cpu usage. For instance my RPi may respond fast because it has a few nodes and little uploading being done (large network). So it responds super fast but due to its 10Mb/s uplink takes long time to send the packet. Overall very slow. But the large machine on a 5Gb/s link with a hundred nodes responds slightly slower but will get you the data faster. Which is the “faster” node?

Better to keep with the client choosing the cheapest and let the woefully slow nodes be shunned out of the network. Cheaper in complexity, auditing of the code, and doesn’t encourage mods to the code to speed up initial response over overall efficiency.

Yea that didn’t happen any more than for the 2GB / 1/2Mb networks. The number of nodes were less, the disk space used by any node was not much more than with those previous beta networks that ran for weeks, so I can be certain it was not due to provisioning issues.

This time it was one person with 10,000 nodes restarting them. 10% of the network went down over a number of minutes. That person owned up to restarting the nodes at the time of the collapse.

And yes if the network is not stable enough for that then it would not survive a major power outage like the one in the past that took out much of the power grid in the US. A faulty relay if I recall correctly that cause a cascade effect blacking out so much of the US large population areas.

joshuef · October 18, 2024, 1:04am

Seems like we had a leak of connections to bad nodes, which exacerbates any collapse.

But also CPU topping out has routinely shown us in testnets that nodes are more likely to be shunned (more connection issues), so preventing that situation will hopefully prevent shunning and knock on effects there.

Depends if overprovisioned or not. While we don’t have a smart enough launch pad to automatically raise/lower node counts. This is a dumb-simple way to prevent that circumstance for now.

Topic		Replies	Views
Update 12th September, 2024 Updates	59	1330	September 19, 2024
Update 25 January, 2024 Updates	145	2811	February 1, 2024
Update 16 May, 2024 Updates	31	1239	May 22, 2024
Update 12th December, 2024 Updates	40	1386	December 19, 2024
Update 18th July, 2024 Updates	30	1071	July 25, 2024

Update 17th October, 2024

Related topics