I believe shunning begins when 3/5 of a close group agree and it is then permanent.
These are experiments so documenting details is premature until the algo is decided upon.
I believe shunning begins when 3/5 of a close group agree and it is then permanent.
These are experiments so documenting details is premature until the algo is decided upon.
Fair enough, true.
I would love a sentiment check on users who see their nodes shunned.
I feel it is demotivating, especially because the whole thing seems to happen without the node operator being able take corrective action.
I would like a warning and opportunity to correct my behavior before I get shunned.
It rubs me up in all the wrong ways .
If users feel demotivatated by the experience that is no good for the network either.
I suspect itās just a matter of the algorithm being a bit overzealous. All ten of my nodes ended up shunned before I restarted. They are all on a small Linode instance, so presumably fairly typical. Interesting that the network still seems to be functioning, given that fact.
my impression. I am traveling and not participating yet still seeing so many shunned nodes gets me worked up
Spoke too soon
0: Failed to upload chunk batch: Multiple consecutive network errors reported during upload
None of my findings are definitive, just sharing what Iāve found so far - and that doesnāt even mean the conclusions are right.
Shunning is on a āpersonalā level. When our log shows someone has āconsidered us as BADā (shunned) thatās just the one node. Itās not the whole group. Iāve seen in the logs that if someone in my close group shuns someone, I get a report of it - however, Iām not sure what effect that has on my nodes view of that node. Is that one strike against it? Does 3 reports from close group about a specific node mean Iāll start shunning it? Still unsure.
Now, on the other side of that is after my node has blocked another node, that is permanent. After Iāve marked another node as bad, itās on the dump list. Every time it tires to connect to my node, our node logs that it is blocked - hence same remote node showing up as blocked over and over by our node logs.
Agreed, really there should be a timeout, or in hockey terms a penalty box assignment with some messaging on how to correct the nodeās behaviour, then a trial re-intro periodā¦ shunning for ever is well too Amish for meā¦
Yeah when its a network error, whose network, the home network or the WAN provider? I donāt like ātaking it on the chinā by some WAN provider who has screwed up his laughable BGP4sec Gateway performance scaling, or is working on the cheapā¦, with a lack of route resiliency from their own backbone provider. Seriously this is why SCION protocol and Geo-fencing comes into play, the boys at ETH in CH have figured that out and spun it out as https://www.anapaya.net/
Perhaps like minded Autonomi Node operators should gang together as a geo-fenced WAN co-op to create an alternate WAN infrastructure that canāt be screwed with for at least part of the Autonomi Network so the node reputation onus is shifted to individual node operators?
Yeah its a big askā¦, just brain storming here.
Something that is confusing me here though is FailedChunkProofCheck
or ReplicationFailure
doesnāt strike me as user related issues?
I could be wrong but are those not deeper rooted issues that need addressed in the code?
To explain my thought more clearly, if a user runs the provided code and it doesnāt do what is expected of it, why is that a shunnable offense.
It can be the nodes environment, bandwidth limitations (possible contention etc.) and suchlike make it of little use (currently) to the network and It would cost more to keep trying and replicating than to just ignore it.
Ahh ok, an indirect warning that you are probably running too many nodes.
That particular possibility (too many for your bandwidth) should be easy for us to figure out.
I was told its every node deciding for itself. close groups are for chunks and each chunk has its close group of nodes. Not nodes forming close groups. Nodes now are autonomous and make decisions on their own, not as any sort of groups, they share information and work off shared information but still decide for themselves.
Nodes have neighbours that are around itself and also the potentially bad node in question, but its like 20 or more, and get information about badness from them. Then it makes its own decision based on its info and info asked for from others.
Agreed and is what @joshuef has been saying.
Yup, the CLOSE_GROUP is a data-ownership concept, but not used at all in the shunning process.
Nodes close to, can and will judge neighbours. They may inform them of that decision (I think thatās the you have been shunned
type log weāre seeingā¦ ). That means that one node has shunned it. That is. currently permanent for that node. But if no one else shuns, youāre still sufficiently in the network.
So effective shunning is an emergent behavior, rather than one that requires consensus or some threshold.
Is there anyway to know how many nodes have shunned you?
And a second question if I may, How does the node know it was shunned by another?? Is it sent a message? Sounds strange you send a node with issues a message when it seems to be faulty/bad LOL
I think just the messages received (i count them up, if youāre getting a lot, youāre probably in a bad place).
Indeed! It was added for debugging and so nodes could potentially act on that info (Iām not sure yet what exactly theyād doā¦ but maybe a nodemanager could reduce the node count if they see shunning happening eg)
Itās an open question if shunning should be permanent or no. Or timeout after X. Iād be curious to hear folks thoughts there!
At the moment Iām not sure what weād do to act on it and fix a node. It is however a useful debugging tool while still figuring out how the nodes interact / what systems can handle, etc.
One thing that Iām noticing is that my nodes, even if I only start 2-3 will usually have one get shunned within 30 seconds to 1 minute seconds of startup. Only one that I remember was something other than āfailedChunkProofā. Iāve also seen in the logs - I believe in quote requests - that there is a read back to the requestor about uptime.
Being that the shunning that Iāve seen (Iām curious if others see the same thing?) is in the āboot upā time of the node (massive download and organizing of chunks) if we could / should not blacklist nodes in the first few minutes of startup while they orient themselves on what chunks they have / should have / not feverishly reading/writing to disk just to get up to par?
Itās a catch 22 on whether the message should go away. For honest node operators itās nice to be able to have an indicator - even if permanent ban - of what is going wrong. Drive going bad, network too slow, under provisioned, whatever.
From the malicious side, itās a tuning tool to be able to perform bare minimum to not get blacklisted / craft a node that is bad, but looks ok. Iād like to see these people wonder why theyāre simply losing peers.
Iām torn long term - but think it should stay for beta phase regardless.
I could see a one time second chance after a timeout (for reasons like the above boot-up/acclimation time). But trying the same known-bad node over and over I think is wasted overhead.
I would say that before it is determined to be bad (permanent currently) then timeouts definitely be in place. There is always transient events and often occur close together.
But in my opinion if the node is determined bad by say more then 5 other nodes then the node manager should be restarting the node with new xor address OR reduce the number of nodes being run on the machine by removing that now bad node. This would require messages sent to the ābadā node tell it it is bad so the node manager (or the node itself) can do a restart.
So no, once bad then its not worth giving it another chance and for good actors then let it restart in a new xor address.
This is probably indicative of failed GETs or so. Thereās a couple angles hereā¦
Yeh at the moment, we allow for some of this, but theyāre not expiring unless we have more events, so adding that in may smooth things over hereā¦
Brainstorming here, perhaps taking a queue from todayās human model, nodes shunned get a ājailā term and must remediate, this is where PoR Proof of Resource Testing should make a come back , where a node ājailedā must run a diagnostics suite to remediate
The above concept might work something like this,
Shunned and now Jailed Node- Get out of Jail remediation, rejoin
IF pass,
a. the PoR Test routine run by the shunned node sends out a Gossip Pass message to the other nodes in the close group
b. the close group nodes shunning the ājailedā node update state from ājailedā to ātrustedā where a āminusā -1 is applied to reduce the reputation class of jailed node ?
Jail Time Consensus
The gossip message from the shunning nodes is time stamped, compared among close group members in a list and average, then the time stamp value is averaged and stored by all Close Group members for the previously jailed node?
The time diff between the start of the ājail sentenceā for the offending node and the amount of time the jailed node spends running the POR routine, plus some initial delay, the ājail termā , in this case the first time offense, is sent out by the offending node by the POR Routine which cannot be manipulated by the node owner,
IF the POR routine if not passed, gossip 'not passed PoR message is posted to log of the jailed node
AND, before it can rejoin its close group it must run a more extensive level 2 set of POR Diagnostics and do additional wait to join back time, with more detailed POR Diagnostics to the jailed nodes logs , to help the node owner fix the problem, likely with some advice on what the fix might be
Some thoughts on this āhow to fix problemā
Reduce node count,
Run fewer competing processes on system
Run existing processes competing with node boot at lower priority
Increase Network Bandwidth
etcā¦
This area requires a whole set of Test Cases with pass/fail metrics which get embodied in the PoR diagnostics
The Next time the same previously jailed and now remediated node is found by the close group to be a repeat offender
The length of āsentenceā or jail time the PoR Diagnostics adds more test cases (edge cases) which take more time, and the wait to join period is extended beyond the first time, with more remedies suggested by the extended PoR Diagnostics in the log used by the node owner to address the edge case conditions that might be contributing to the nodeās behaviour that caused the node to be shunned and jailed in the first place by a consensus of the close group member nodes.
Since nodes can act on their own, they can choose, based on their owners ātunedā or configured criteria to continue to shun what they see as an offending node OR go with the consensus and adjust their view of the node in a āgreyā list giving the node a lower reputation commensurate with the offence, it might be that say trying a double spend bans you for ever, where being slow to respond, or being offline for extended periods is given more leniency?
Anyway, drawing parallels to the human model of ājail timeā and remediation may give us all some food for thought on how we might address this set of challenges , where PoR Diagnostics might hold part of the solution which keeps node owners on their toes to configure appropriately but also educates them in the process with a trusted agnostic POR Diagnostics feedback loop so they can remediate the shunned/offending and fairly jailed node(penalty commensurate with the offence) quickly, in proactive manner based on informative log info.
You could implement a timeout for the shunning period, and increment the timeout if it happens to a node again.
You could call the feature Shuncrement