Shunned

I believe shunning begins when 3/5 of a close group agree and it is then permanent.

These are experiments so documenting details is premature until the algo is decided upon.

2 Likes

Fair enough, true.

I would love a sentiment check on users who see their nodes shunned.
I feel it is demotivating, especially because the whole thing seems to happen without the node operator being able take corrective action.

I would like a warning and opportunity to correct my behavior before I get shunned.

It rubs me up in all the wrong ways :confused:.

If users feel demotivatated by the experience that is no good for the network either.

3 Likes

I suspect itā€™s just a matter of the algorithm being a bit overzealous. All ten of my nodes ended up shunned before I restarted. They are all on a small Linode instance, so presumably fairly typical. Interesting that the network still seems to be functioning, given that fact.

2 Likes

:100: my impression. I am traveling and not participating yet still seeing so many shunned nodes gets me worked up :rofl:

2 Likes

Spoke too soon
0: Failed to upload chunk batch: Multiple consecutive network errors reported during upload

1 Like

None of my findings are definitive, just sharing what Iā€™ve found so far - and that doesnā€™t even mean the conclusions are right. :stuck_out_tongue:


Most of these shunned nodes - currently on the community net - were shunned within 30 seconds of startup. Since then theyā€™ve all still stayed well connected and been getting chunks and payments. If youā€™re shunned you canā€™t connect to the node that shunned you. So if youā€™re connected to other nodes, youā€™re not completely shunned.

Shunning is on a ā€œpersonalā€ level. When our log shows someone has ā€œconsidered us as BADā€ (shunned) thatā€™s just the one node. Itā€™s not the whole group. Iā€™ve seen in the logs that if someone in my close group shuns someone, I get a report of it - however, Iā€™m not sure what effect that has on my nodes view of that node. Is that one strike against it? Does 3 reports from close group about a specific node mean Iā€™ll start shunning it? Still unsure.

Now, on the other side of that is after my node has blocked another node, that is permanent. After Iā€™ve marked another node as bad, itā€™s on the dump list. Every time it tires to connect to my node, our node logs that it is blocked - hence same remote node showing up as blocked over and over by our node logs.

6 Likes

Agreed, really there should be a timeout, or in hockey terms a penalty box assignment with some messaging on how to correct the nodeā€™s behaviour, then a trial re-intro periodā€¦ shunning for ever is well too Amish for meā€¦

3 Likes

Yeah when its a network error, whose network, the home network or the WAN provider? I donā€™t like ā€˜taking it on the chinā€™ by some WAN provider who has screwed up his laughable BGP4sec Gateway performance scaling, or is working on the cheapā€¦, with a lack of route resiliency from their own backbone provider. Seriously this is why SCION protocol and Geo-fencing comes into play, the boys at ETH in CH have figured that out and spun it out as https://www.anapaya.net/

Perhaps like minded Autonomi Node operators should gang together as a geo-fenced WAN co-op to create an alternate WAN infrastructure that canā€™t be screwed with for at least part of the Autonomi Network so the node reputation onus is shifted to individual node operators?

Yeah its a big askā€¦, just brain storming here. :wink:

3 Likes

Something that is confusing me here though is FailedChunkProofCheck or ReplicationFailure doesnā€™t strike me as user related issues?

I could be wrong but are those not deeper rooted issues that need addressed in the code?

To explain my thought more clearly, if a user runs the provided code and it doesnā€™t do what is expected of it, why is that a shunnable offense.

3 Likes

It can be the nodes environment, bandwidth limitations (possible contention etc.) and suchlike make it of little use (currently) to the network and It would cost more to keep trying and replicating than to just ignore it.

5 Likes

Ahh ok, an indirect warning that you are probably running too many nodes.

That particular possibility (too many for your bandwidth) should be easy for us to figure out.

4 Likes

I was told its every node deciding for itself. close groups are for chunks and each chunk has its close group of nodes. Not nodes forming close groups. Nodes now are autonomous and make decisions on their own, not as any sort of groups, they share information and work off shared information but still decide for themselves.

Nodes have neighbours that are around itself and also the potentially bad node in question, but its like 20 or more, and get information about badness from them. Then it makes its own decision based on its info and info asked for from others.

Agreed and is what @joshuef has been saying.

2 Likes

Yup, the CLOSE_GROUP is a data-ownership concept, but not used at all in the shunning process.

Nodes close to, can and will judge neighbours. They may inform them of that decision (I think thatā€™s the you have been shunned type log weā€™re seeingā€¦ ). That means that one node has shunned it. That is. currently permanent for that node. But if no one else shuns, youā€™re still sufficiently in the network.

So effective shunning is an emergent behavior, rather than one that requires consensus or some threshold.

5 Likes

Is there anyway to know how many nodes have shunned you?

And a second question if I may, How does the node know it was shunned by another?? Is it sent a message? Sounds strange you send a node with issues a message when it seems to be faulty/bad LOL

3 Likes

I think just the messages received (i count them up, if youā€™re getting a lot, youā€™re probably in a bad place).

Indeed! It was added for debugging and so nodes could potentially act on that info (Iā€™m not sure yet what exactly theyā€™d doā€¦ but maybe a nodemanager could reduce the node count if they see shunning happening eg)


Itā€™s an open question if shunning should be permanent or no. Or timeout after X. Iā€™d be curious to hear folks thoughts there!

5 Likes

At the moment Iā€™m not sure what weā€™d do to act on it and fix a node. It is however a useful debugging tool while still figuring out how the nodes interact / what systems can handle, etc.

One thing that Iā€™m noticing is that my nodes, even if I only start 2-3 will usually have one get shunned within 30 seconds to 1 minute seconds of startup. Only one that I remember was something other than ā€œfailedChunkProofā€. Iā€™ve also seen in the logs - I believe in quote requests - that there is a read back to the requestor about uptime.

Being that the shunning that Iā€™ve seen (Iā€™m curious if others see the same thing?) is in the ā€œboot upā€ time of the node (massive download and organizing of chunks) if we could / should not blacklist nodes in the first few minutes of startup while they orient themselves on what chunks they have / should have / not feverishly reading/writing to disk just to get up to par?

Itā€™s a catch 22 on whether the message should go away. For honest node operators itā€™s nice to be able to have an indicator - even if permanent ban - of what is going wrong. Drive going bad, network too slow, under provisioned, whatever.
From the malicious side, itā€™s a tuning tool to be able to perform bare minimum to not get blacklisted / craft a node that is bad, but looks ok. Iā€™d like to see these people wonder why theyā€™re simply losing peers.
Iā€™m torn long term - but think it should stay for beta phase regardless.

I could see a one time second chance after a timeout (for reasons like the above boot-up/acclimation time). But trying the same known-bad node over and over I think is wasted overhead.

4 Likes

I would say that before it is determined to be bad (permanent currently) then timeouts definitely be in place. There is always transient events and often occur close together.

But in my opinion if the node is determined bad by say more then 5 other nodes then the node manager should be restarting the node with new xor address OR reduce the number of nodes being run on the machine by removing that now bad node. This would require messages sent to the ā€œbadā€ node tell it it is bad so the node manager (or the node itself) can do a restart.

So no, once bad then its not worth giving it another chance and for good actors then let it restart in a new xor address.

2 Likes

This is probably indicative of failed GETs or so. Thereā€™s a couple angles hereā€¦

  • we could be more tolerant
  • nodes could prefetch data so they arenā€™t failing

Yeh at the moment, we allow for some of this, but theyā€™re not expiring unless we have more events, so adding that in may smooth things over hereā€¦ :thinking:

3 Likes

Brainstorming here, perhaps taking a queue from todayā€™s human model, nodes shunned get a ā€˜jailā€™ term and must remediate, this is where PoR Proof of Resource Testing should make a come back , where a node ā€˜jailedā€™ must run a diagnostics suite to remediate

The above concept might work something like this,

Shunned and now Jailed Node- Get out of Jail remediation, rejoin

  1. pass a level one set of PoR tests which take minutes of time,

IF pass,

a. the PoR Test routine run by the shunned node sends out a Gossip Pass message to the other nodes in the close group

b. the close group nodes shunning the ā€˜jailedā€™ node update state from ā€˜jailedā€™ to ā€˜trustedā€™ where a ā€˜minusā€™ -1 is applied to reduce the reputation class of jailed node ?

Jail Time Consensus
The gossip message from the shunning nodes is time stamped, compared among close group members in a list and average, then the time stamp value is averaged and stored by all Close Group members for the previously jailed node?

The time diff between the start of the ā€˜jail sentenceā€™ for the offending node and the amount of time the jailed node spends running the POR routine, plus some initial delay, the ā€˜jail termā€™ , in this case the first time offense, is sent out by the offending node by the POR Routine which cannot be manipulated by the node owner,

IF the POR routine if not passed, gossip 'not passed PoR message is posted to log of the jailed node

AND, before it can rejoin its close group it must run a more extensive level 2 set of POR Diagnostics and do additional wait to join back time, with more detailed POR Diagnostics to the jailed nodes logs , to help the node owner fix the problem, likely with some advice on what the fix might be

Some thoughts on this ā€˜how to fix problemā€™

Reduce node count,
Run fewer competing processes on system
Run existing processes competing with node boot at lower priority
Increase Network Bandwidth
etcā€¦
This area requires a whole set of Test Cases with pass/fail metrics which get embodied in the PoR diagnostics

The Next time the same previously jailed and now remediated node is found by the close group to be a repeat offender

The length of ā€˜sentenceā€™ or jail time the PoR Diagnostics adds more test cases (edge cases) which take more time, and the wait to join period is extended beyond the first time, with more remedies suggested by the extended PoR Diagnostics in the log used by the node owner to address the edge case conditions that might be contributing to the nodeā€™s behaviour that caused the node to be shunned and jailed in the first place by a consensus of the close group member nodes.

Since nodes can act on their own, they can choose, based on their owners ā€˜tunedā€™ or configured criteria to continue to shun what they see as an offending node OR go with the consensus and adjust their view of the node in a ā€˜greyā€™ list giving the node a lower reputation commensurate with the offence, it might be that say trying a double spend bans you for ever, where being slow to respond, or being offline for extended periods is given more leniency?

Anyway, drawing parallels to the human model of ā€˜jail timeā€™ and remediation may give us all some food for thought on how we might address this set of challenges , where PoR Diagnostics might hold part of the solution which keeps node owners on their toes to configure appropriately but also educates them in the process with a trusted agnostic POR Diagnostics feedback loop so they can remediate the shunned/offending and fairly jailed node(penalty commensurate with the offence) quickly, in proactive manner based on informative log info. :wink:

2 Likes

You could implement a timeout for the shunning period, and increment the timeout if it happens to a node again.

You could call the feature Shuncrement :copyright: :crazy_face:

2 Likes