How to keep your nodes healthy and productive

In my experience UDP packet loss (between say my cloud VM and home server) can be quite bad and lead to significantly reduced earnings, all while avoiding shunning. Further, UDP packet loss also depends on ISP’s setup, no matter how great your own router is.

Has anyone done packet loss testing? What is typical UDP packet loss between home nodes? 1%, 10%? Things are clearly better in a datacenter than at home.

I considered monitoring packet loss to optimize number of nodes per IP but even just measuring the loss was tricky. I hope that ant nodes will become able to measure the quality of their connectivity to the network using metrics such as autonomi message loss rate.

3 Likes

Yes.

Formicaio gives node-specific reports, how many other nodes have shunned it. I wonder if these low earning nodes have gotten some shuns by other nodes.

I don’t have any nodes right now, but I used to run 6-8 nodes for almost 2 weeks behind a crap(ish) router, and didn’t get any shuns. Oh, but I didn’t get any records either, so of course there is no shuns… :man_facepalming: When the network is empty, there is apparently little chances to know how well the connections work, because you could be out of reach, but not getting shunned.

There would definitely have ben something else, quality of cables, too far for Category being used, faulty cables.

The distance of a network LAN cable is so small compared to the distance to the ISP that 10 meters or 20 meters mades zero difference.

But low quality cables then 10m and 20m will make a big difference due to errors. Thats why those using smaller cables often do better, because of the cables themselves not the lag due to distance. nanoSeconds in extra lag in your home will have no, zero, material effect on the network.

Also the network is not supposed to be phased because a datacentre saves 10 or 20 mS of lag time.

In my home I have quality cat 7 cables even though its only 1Gbps being used for the most part and they are 20 meters from the router, router through a patch panel then 20m cable to another switch. Below the max length for cat 7 and well within the specs for 1Gbps

I can definitely tell you the lag time due to a LAN network cable has zero difference. Its not lag time but quality of cable or it could be faulty, or just toooo long for the specs (equivalent to faulty cable)

Nodes do not consider IP address other than it is needed to get to a peer node. The only address that affects connection decisions is the XOR address of the peer/node

Yes, that’s what I meant. I’ll edit it to be more specific.

1 Like

Correct, it is individual nodes shunning, not a collective decision. So a node maybe getting shunned by 20 of its old routing table peers. But it is still connected to the network and functioning. Just may not get selected for quotes as often.

Another reason having too many nodes on a device or ISP connection might cause loss of emissions and that is the nodes are shunned more often by other individual nodes.

My wee mini pc farm is getting similar ANT per node. Probably about 1500 nodes, earning about 25-30 at the same date.

That seems like very low earnings. Must be something not right there?

I wonder if max sockets is a consideration? 50k default on linux, from a quick google. 20k nodes could certainly breach that.

Can be tweaked upwards to 1m, which with 20k nodes could be necessary?

Just thinking poorly connected nodes could suffer in general

1 Like

I think sockets (and max number of them) are TCP related, not UDP (Edit: not so sure about this, does someone know?).
Maybe max number of file handlers etc. counts?
Edit or google for this: net.core.rmem_max

https://www.baeldung.com/linux/udp-socket-buffer

1 Like

Ahaa… So shunning is actually record-specific, so to speak? Is it like this:

  • A node fails to deliver a certain chunk, and is shunned by the node asking for it. Then that record is replicated to somewhere else.
  • At the same time (or close by) it delivers another record to another node, and is not shunned, and the record is not replicated to somewhere else.
1 Like

Yep, shunning is done by a node because it saw wrong or bad behaviour by your node. Its only that one node that shunned you. And actually it was 3 times seeing bad behaviour before it shuns.

So in effect, out of the 200 odd routeing table peers your node has there might have been one or more that decided for some reason your node was bad. The reason is a combination of 3 times within a certain period it saw bad behaviour.

Can be your node refusing to talk, not validating a chunk its supposed to have, not accepting a chunk it should, and I am sure there are other reasons.

1 Like

In my opinion, they should be shunned only for limited time, of course long enought time.

I guess this should not harm

sudo sysctl -w net.core.rmem_max=4096000

So the buffer size is equal to autonomi chunk size, 4M? Maybe I should double that?
(Default max (and default default :slight_smile: ) seems to be 212992, which sounds a bit too small?

The network morphs and nodes join/leave making it unnecessary to unshun the nodes that maybe healthy.

I did explain some where else why its not such a good idea, and its up to the node control software to detect too high a value for “shun by” and orderly restart the node with fresh ID (delete its data dir first). This figure is in the /metrics and thus can be read and know by a monitoring program/script.

No need to restart healthy nodes.

A malicious node should always remain shunned

One major reason to not un-shun a non-malicious node is because it might be on an overloaded PC and it gets un-shunned only to cause network issues all over again to be shunned again. Or its a faulty router etc

Basically for a node to get shunned by enough other nodes to be a problem for that node, it has to have done serious badness. You do not want to have that node in the network, more harm than good.

To un-shun after a period is to assumed nodes are getting shunned for no good reason.

2 Likes

Great point and questions. Worth digging into, I reckon.

But if it is intentional, the bad guy just drops the shunned node and creates a new one.

Long enought shun time minimizes the harm close to 0.

Yes that’s true. If the nodes never get stunned for wrong reasons, then there are no reason to let them get back. But is it so now? If there is a short network break which will recover, will a node be shunned?

But over a long period, weeks, months, maybe years, all nodes end up being shunned by majority of the network if they can never recover.

1 Like

Obviously, but still it should remain shunned anyhow.

Dunno what you mean, but if a node is badly behaved because of its environment then really do not want to be taking the risks. Lets say you set one day to be shunned then all you do is keep cycling bad behaviour on the network day after day.

Remember its still only the node shunning the other one and if the node is really bad then eventually all the closer nodes will shun it. The nodes are NOT shunning because of a few connection dropouts. The node really has to be acting badly

It is up to the operator to decide to restart a bad node.

But with what will end up 100 million nodes the network can afford to not let back nodes that have a high chance of not behaving correctly. It will actually benefit the network.

These nodes are like slim mean machines that can be disposed of it they crash/act badly. Also machines with too many nodes shouldn’t be allowed to cause quote errors and upload failures due to then behaving badly. It means the rest of the network is working harder each time.

Also teaches people to not run too many nodes if their nodes keep getting shunned.

I am sure that monitoring scripts/programs will be released (like NTracker and the container ones) that will identify nodes with too high a “Shun by” value and either remove that node forever reducing overall count, or deleted the data dir for the node and then restart it.

Its not upto the network, as in local nodes to it, deciding to unshun and then work harder because it is not behaving. Its stupidity to do that and far worse than just restarting it.

Again you assume that healthy nodes (itself and its environment) are being shunned. So you say its correct good nodes are not shunned and then say if they are. What do you believe?

All that is needed is for that one node with too high a figure for “shunned by” to be restarted.

Go and look at your nodes and tell me how many are being shunned by more than a few. If you have a healthy environment then its not worth allow the many bad nodes back for the 1 or 2 that were good really (if any at all), especially when the monitoring program can just flag it for a restart with cleaned out data dir.

Why save the one node and let the 100;s bad ones back? Very unhealthy for all the close nodes to those bad ones.

BS - Look at all your good nodes and tell me any that are shunned by more than 1. Good nodes are extremely rare to be shunned by more than a few over time.

The network morphs and changes over time so your case of weeks months, years just doesn’t hold. Nodes are coming and leaving. Nodes will not be up for years anyhow to keep shunning. Restart resets the node’s shun list and even data centre servers do not claim to be on all the time, they have the occasional downtime, and that resets the nodes. So machine upgrades/restart/maintenance and node s/w upgrades will restart all nodes across the network at least once a year removing all shun lists anyhow.

For every good node shunned there will be hundreds of bad nodes. You are opting for the network to suffer these bad nodes for the sake of a good one that the monitoring will restart at a new ID anyhow. Crazy to me. The nodes are not that critical that we have to save them at all costs, just let the monitoring script restart it at a new ID

Your impossible scenario then with restarting of overly shunned nodes will be impossible doubly

1 Like

With a server having its own IP and no firewall, is there any benefit to specifying --node-port? I’ve been leaving it off in that case since all ports are directly usable.

I do since it makes sure they are not going to conflict if I want to run something after that is trying to tie itself to a port a node took

Also I run a firewall and need to know which ports to open

1 Like