Shunned

happybeing · April 21, 2024, 10:35am

I believe shunning begins when 3/5 of a close group agree and it is then permanent.

These are experiments so documenting details is premature until the algo is decided upon.

Josh · April 21, 2024, 10:58am

Fair enough, true.

I would love a sentiment check on users who see their nodes shunned.
I feel it is demotivating, especially because the whole thing seems to happen without the node operator being able take corrective action.

I would like a warning and opportunity to correct my behavior before I get shunned.

It rubs me up in all the wrong ways .

If users feel demotivatated by the experience that is no good for the network either.

JPL · April 21, 2024, 11:04am

I suspect it’s just a matter of the algorithm being a bit overzealous. All ten of my nodes ended up shunned before I restarted. They are all on a small Linode instance, so presumably fairly typical. Interesting that the network still seems to be functioning, given that fact.

Josh · April 21, 2024, 11:06am

my impression. I am traveling and not participating yet still seeing so many shunned nodes gets me worked up

JPL · April 21, 2024, 11:08am

Spoke too soon
0: Failed to upload chunk batch: Multiple consecutive network errors reported during upload

wes · April 21, 2024, 2:30pm

None of my findings are definitive, just sharing what I’ve found so far - and that doesn’t even mean the conclusions are right.

Most of these shunned nodes - currently on the community net - were shunned within 30 seconds of startup. Since then they’ve all still stayed well connected and been getting chunks and payments. If you’re shunned you can’t connect to the node that shunned you. So if you’re connected to other nodes, you’re not completely shunned.

Shunning is on a “personal” level. When our log shows someone has “considered us as BAD” (shunned) that’s just the one node. It’s not the whole group. I’ve seen in the logs that if someone in my close group shuns someone, I get a report of it - however, I’m not sure what effect that has on my nodes view of that node. Is that one strike against it? Does 3 reports from close group about a specific node mean I’ll start shunning it? Still unsure.

Now, on the other side of that is after my node has blocked another node, that is permanent. After I’ve marked another node as bad, it’s on the dump list. Every time it tires to connect to my node, our node logs that it is blocked - hence same remote node showing up as blocked over and over by our node logs.

rreive · April 21, 2024, 8:29pm

Agreed, really there should be a timeout, or in hockey terms a penalty box assignment with some messaging on how to correct the node’s behaviour, then a trial re-intro period… shunning for ever is well too Amish for me…

rreive · April 21, 2024, 8:36pm

Yeah when its a network error, whose network, the home network or the WAN provider? I don’t like ‘taking it on the chin’ by some WAN provider who has screwed up his laughable BGP4sec Gateway performance scaling, or is working on the cheap…, with a lack of route resiliency from their own backbone provider. Seriously this is why SCION protocol and Geo-fencing comes into play, the boys at ETH in CH have figured that out and spun it out as https://www.anapaya.net/

Perhaps like minded Autonomi Node operators should gang together as a geo-fenced WAN co-op to create an alternate WAN infrastructure that can’t be screwed with for at least part of the Autonomi Network so the node reputation onus is shifted to individual node operators?

Yeah its a big ask…, just brain storming here.

Josh · April 21, 2024, 8:51pm

Something that is confusing me here though is FailedChunkProofCheck or ReplicationFailure doesn’t strike me as user related issues?

I could be wrong but are those not deeper rooted issues that need addressed in the code?

To explain my thought more clearly, if a user runs the provided code and it doesn’t do what is expected of it, why is that a shunnable offense.

dirvine · April 21, 2024, 10:55pm

It can be the nodes environment, bandwidth limitations (possible contention etc.) and suchlike make it of little use (currently) to the network and It would cost more to keep trying and replicating than to just ignore it.

Josh · April 21, 2024, 11:10pm

Ahh ok, an indirect warning that you are probably running too many nodes.

That particular possibility (too many for your bandwidth) should be easy for us to figure out.

neo · April 23, 2024, 11:40pm

I was told its every node deciding for itself. close groups are for chunks and each chunk has its close group of nodes. Not nodes forming close groups. Nodes now are autonomous and make decisions on their own, not as any sort of groups, they share information and work off shared information but still decide for themselves.

Nodes have neighbours that are around itself and also the potentially bad node in question, but its like 20 or more, and get information about badness from them. Then it makes its own decision based on its info and info asked for from others.

Agreed and is what @joshuef has been saying.

joshuef · April 24, 2024, 12:33am

Yup, the CLOSE_GROUP is a data-ownership concept, but not used at all in the shunning process.

Nodes close to, can and will judge neighbours. They may inform them of that decision (I think that’s the you have been shunned type log we’re seeing… ). That means that one node has shunned it. That is. currently permanent for that node. But if no one else shuns, you’re still sufficiently in the network.

So effective shunning is an emergent behavior, rather than one that requires consensus or some threshold.

neo · April 24, 2024, 12:38am

Is there anyway to know how many nodes have shunned you?

And a second question if I may, How does the node know it was shunned by another?? Is it sent a message? Sounds strange you send a node with issues a message when it seems to be faulty/bad LOL

joshuef · April 24, 2024, 12:51am

I think just the messages received (i count them up, if you’re getting a lot, you’re probably in a bad place).

Indeed! It was added for debugging and so nodes could potentially act on that info (I’m not sure yet what exactly they’d do… but maybe a nodemanager could reduce the node count if they see shunning happening eg)

It’s an open question if shunning should be permanent or no. Or timeout after X. I’d be curious to hear folks thoughts there!

wes · April 24, 2024, 3:29am

At the moment I’m not sure what we’d do to act on it and fix a node. It is however a useful debugging tool while still figuring out how the nodes interact / what systems can handle, etc.

One thing that I’m noticing is that my nodes, even if I only start 2-3 will usually have one get shunned within 30 seconds to 1 minute seconds of startup. Only one that I remember was something other than “failedChunkProof”. I’ve also seen in the logs - I believe in quote requests - that there is a read back to the requestor about uptime.

Being that the shunning that I’ve seen (I’m curious if others see the same thing?) is in the “boot up” time of the node (massive download and organizing of chunks) if we could / should not blacklist nodes in the first few minutes of startup while they orient themselves on what chunks they have / should have / not feverishly reading/writing to disk just to get up to par?

It’s a catch 22 on whether the message should go away. For honest node operators it’s nice to be able to have an indicator - even if permanent ban - of what is going wrong. Drive going bad, network too slow, under provisioned, whatever.
From the malicious side, it’s a tuning tool to be able to perform bare minimum to not get blacklisted / craft a node that is bad, but looks ok. I’d like to see these people wonder why they’re simply losing peers.
I’m torn long term - but think it should stay for beta phase regardless.

I could see a one time second chance after a timeout (for reasons like the above boot-up/acclimation time). But trying the same known-bad node over and over I think is wasted overhead.

neo · April 24, 2024, 6:37am

I would say that before it is determined to be bad (permanent currently) then timeouts definitely be in place. There is always transient events and often occur close together.

But in my opinion if the node is determined bad by say more then 5 other nodes then the node manager should be restarting the node with new xor address OR reduce the number of nodes being run on the machine by removing that now bad node. This would require messages sent to the “bad” node tell it it is bad so the node manager (or the node itself) can do a restart.

So no, once bad then its not worth giving it another chance and for good actors then let it restart in a new xor address.

joshuef · April 24, 2024, 6:49am

This is probably indicative of failed GETs or so. There’s a couple angles here…

we could be more tolerant
nodes could prefetch data so they aren’t failing

Yeh at the moment, we allow for some of this, but they’re not expiring unless we have more events, so adding that in may smooth things over here…

rreive · April 24, 2024, 1:57pm

Brainstorming here, perhaps taking a queue from today’s human model, nodes shunned get a ‘jail’ term and must remediate, this is where PoR Proof of Resource Testing should make a come back , where a node ‘jailed’ must run a diagnostics suite to remediate

The above concept might work something like this,

Shunned and now Jailed Node- Get out of Jail remediation, rejoin

pass a level one set of PoR tests which take minutes of time,

IF pass,

a. the PoR Test routine run by the shunned node sends out a Gossip Pass message to the other nodes in the close group

b. the close group nodes shunning the ‘jailed’ node update state from ‘jailed’ to ‘trusted’ where a ‘minus’ -1 is applied to reduce the reputation class of jailed node ?

Jail Time Consensus
The gossip message from the shunning nodes is time stamped, compared among close group members in a list and average, then the time stamp value is averaged and stored by all Close Group members for the previously jailed node?

The time diff between the start of the ‘jail sentence’ for the offending node and the amount of time the jailed node spends running the POR routine, plus some initial delay, the ‘jail term’ , in this case the first time offense, is sent out by the offending node by the POR Routine which cannot be manipulated by the node owner,

IF the POR routine if not passed, gossip 'not passed PoR message is posted to log of the jailed node

AND, before it can rejoin its close group it must run a more extensive level 2 set of POR Diagnostics and do additional wait to join back time, with more detailed POR Diagnostics to the jailed nodes logs , to help the node owner fix the problem, likely with some advice on what the fix might be

Some thoughts on this ‘how to fix problem’

Reduce node count,
Run fewer competing processes on system
Run existing processes competing with node boot at lower priority
Increase Network Bandwidth
etc…
This area requires a whole set of Test Cases with pass/fail metrics which get embodied in the PoR diagnostics

The Next time the same previously jailed and now remediated node is found by the close group to be a repeat offender

The length of ‘sentence’ or jail time the PoR Diagnostics adds more test cases (edge cases) which take more time, and the wait to join period is extended beyond the first time, with more remedies suggested by the extended PoR Diagnostics in the log used by the node owner to address the edge case conditions that might be contributing to the node’s behaviour that caused the node to be shunned and jailed in the first place by a consensus of the close group member nodes.

Since nodes can act on their own, they can choose, based on their owners ‘tuned’ or configured criteria to continue to shun what they see as an offending node OR go with the consensus and adjust their view of the node in a ‘grey’ list giving the node a lower reputation commensurate with the offence, it might be that say trying a double spend bans you for ever, where being slow to respond, or being offline for extended periods is given more leniency?

Anyway, drawing parallels to the human model of ‘jail time’ and remediation may give us all some food for thought on how we might address this set of challenges , where PoR Diagnostics might hold part of the solution which keeps node owners on their toes to configure appropriately but also educates them in the process with a trusted agnostic POR Diagnostics feedback loop so they can remediate the shunned/offending and fairly jailed node(penalty commensurate with the offence) quickly, in proactive manner based on informative log info.

Joop77 · April 24, 2024, 4:03pm

You could implement a timeout for the shunning period, and increment the timeout if it happens to a node again.

You could call the feature Shuncrement

Topic		Replies	Views
RPI 4: Is this thing even on? Support	12	382	August 31, 2024
Update 25th April, 2024 Updates	14	1142	May 2, 2024
Node status [error] definitions Support	1	62	September 15, 2024
Back again after another extended (non-testing) absence . . can't connect . Support	9	92	December 14, 2024
Safenode program built from current github will not join the network (got the keys from logs) Support	6	56	December 1, 2024

Shunned

Related topics