Node Manager UX and Issues

But is that fluctuations in what the node manager is reporting, or from the values in the logs? Because what we’re saying is, if we change the node manager to report what’s in the routing table, we wouldn’t expect the fluctuations any more.

2 Likes

Ahh yes ok, routing table is very stable.

Still confused as to why those 2 nodes have no rewards, I see very few errors in the logs.
But as Qi says 7 is low.

If other nodes think these 2 are bad for some reason, 2 out of 29 nodes that are seemingly shunned.

Would I see any indication that they do? would be nice if my logs said, you were reported as bad for this reason.

2 Likes

Yeah, we could have that notification to be sent, so that a node can know what others thinking about it.
Good for node owners

5 Likes

While we’re talking about adding things to RPC and/or the status --json command I have a small wish list of things. Take them or leave them, just putting it out there. Just trying to get a full picture of node data over time.

  • Current records “responsible for” (now sure how that’s phrased, but it’s being calculated for store-cost)
  • Total records for node (Cred. @neo )
  • Current store cost for a chunk (currently in logs)
  • Current wallet balance (Has RPC call, would love it in nodemanager status)
  • CPU Usage of node (currently in logs)
  • Mem usage (Currently in logs)

Bonus for:

  • Network load of the node itself
4 Likes

Included in that it’d be great to have a total down/up count of data (need uptime as well, but that should be there anyhow)

Current records is also referred to as active records. And for that matter inactive records would be good too, as that is not always = total-active, eg early node operations.

Is XOR address there as well

1 Like

There is a figure called inactive nodes.

Active + inactive != total nodes. They can be equal but not always, especially when few churns have happened involving the node.

Need all three really if we are being complete and can cover all things

1 Like

Thank you @qi_ma

warn!("Peer {detected_by:?} consider us as BAD, due to {bad_behaviour:?}.");

Just to clarify, how many nodes need to agree that a node is bad before it is shunned.

As you cannot recover from being labeled bad, how hard is it to be expelled.

2 Likes

majority of the close_group nodes
and each one has to accumulate multiple issues (bad_quote, cann’t fetch copy, chunk_proof verification failure) of same kind within certain period, then can consider peer is BAD.

4 Likes

Can you clarify, please?

ONE report of bad quote and ONE report of chunk_proof verification failure for the same address will NOT result in shunning? Needs to be two or more of the same category?

No

YES.
Actually, shall be three or more of the same category

7 Likes

I view it as @Josh that a bad node will be shunned one by one by its neighbours (maybe just close nodes) as each of those neighbours decide to shun it.

As I understand it, the shunning is done by each node independently based on what it sees and asking other neighbours close and further away (according to what was said in here)

So not so much how many nodes before it is shunned but how many nodes need to shun it before it cannot effective operate in the network. Effectiveness (connectivity to network) will degrade one by one as more nodes decide to shun it

@Josh Your question suggested consensus between nodes to do the shunning and the response feed into that. But my understanding from reading everything here is that it is nodes deciding independently to shun a bad node

2 Likes

Yes, I am going in circles.

I also thought it was independent but then started wondering why nodes report other nodes as bad.

Why report the status to other nodes if it has no impact on decisions.

3 Likes

One way to think of it is how humans in a nice world interact as neighbours. (ie Scotland countryside)

If you have an unruly neighbour and another neighbour asks you about them then you would tell them that. That other neighbour would take your report into account when deciding how they will interact with the “unruly” neighbour.

So either the reported status was asked for or the node was being neighbourly warning of the unruly node.

Its not consensus where they all talk to each other and vote on the status of the potentially unruly neighbour. But it is sharing observations to other neighbours. Its actually a huge difference.

**I used unruly here because its a better descriptor when talking about humans, “not following the rules”

6 Likes

This is the way.

It does both things.

  1. Close nodes can directly be assessed
  2. Far nodes cannot, so we intermittently ask their cose nodes their opinion. When we get a threashold of folk saying “dude is bad”, we decide to opt out.

It’s not consensus in the cryptographic / ordered sense. It’s eventually consistent across the network.

Ah, @neo has another great analogy there too!

8 Likes

Just out of interest what would happen if I ran safenode-manager upgrade on my alpha nodes? Is there something to prevent that, do they just get shunned, cause in my head I see them thinking that they are responsible for the data that is from another network?

Just a small gripe of mine Chris, if we could reduce keystrokes safenode-manager is a lot.

1 Like

Alias is good on Linux.

4 Likes

ping @chriso

would it be a PITA to have --interval on the upgrade command as well ?

ubuntu@s13:~$ sudo env "PATH=$PATH" safenode-manager upgrade -h
Upgrade safenode services

Usage: safenode-manager upgrade [OPTIONS]

Options:
      --do-not-start                 Set this flag to upgrade the nodes without automatically starting them
      --env <env>                    Provide environment variables for the safenode service
      --force                        Set this flag to force the upgrade command to replace binaries without comparing any version numbers
      --peer-id <PEER_ID>            The peer ID of the service to upgrade
      --service-name <SERVICE_NAME>  The name of the service to upgrade
      --url <URL>                    Provide a binary to upgrade to using a URL
      --version <VERSION>            Upgrade to a specific version rather than the latest version
  -h, --help                         Print help (see more with '--help')

i tried to upgrade another machine and the I start getting errors on the latter nodes on the upgrade process and processor was maxed.

✓ safenode1 upgraded from 0.105.2 to 0.105.6
✓ safenode2 upgraded from 0.105.2 to 0.105.6
✓ safenode3 upgraded from 0.105.2 to 0.105.6
✓ safenode4 upgraded from 0.105.2 to 0.105.6
✓ safenode5 upgraded from 0.105.2 to 0.105.6
✓ safenode6 upgraded from 0.105.2 to 0.105.6
✓ safenode7 upgraded from 0.105.2 to 0.105.6
✓ safenode8 upgraded from 0.105.2 to 0.105.6
✓ safenode9 upgraded from 0.105.2 to 0.105.6
✓ safenode10 upgraded from 0.105.2 to 0.105.6
✓ safenode11 upgraded from 0.105.2 to 0.105.6
✓ safenode12 upgraded from 0.105.2 to 0.105.6
✓ safenode13 upgraded from 0.105.2 to 0.105.6
✓ safenode14 upgraded from 0.105.2 to 0.105.6
✓ safenode15 upgraded from 0.105.2 to 0.105.6
✓ safenode16 upgraded from 0.105.2 to 0.105.6
✓ safenode17 upgraded from 0.105.2 to 0.105.6
✓ safenode18 upgraded from 0.105.2 to 0.105.6
✓ safenode19 upgraded from 0.105.2 to 0.105.6
✓ safenode20 upgraded from 0.105.2 to 0.105.6
✓ safenode21 upgraded from 0.105.2 to 0.105.6
✓ safenode22 upgraded from 0.105.2 to 0.105.6
✓ safenode23 upgraded from 0.105.2 to 0.105.6
✓ safenode24 upgraded from 0.105.2 to 0.105.6
✓ safenode25 upgraded from 0.105.2 to 0.105.6
✕ safenode26 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:46563'
✕ safenode27 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:34315'
✓ safenode28 upgraded from 0.105.2 to 0.105.6
✓ safenode29 upgraded from 0.105.2 to 0.105.6
✕ safenode30 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:45593'
✓ safenode31 upgraded from 0.105.2 to 0.105.6
✕ safenode32 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:39499'
✓ safenode33 upgraded from 0.105.2 to 0.105.6
✕ safenode34 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:40425'
✓ safenode35 upgraded from 0.105.2 to 0.105.6
✕ safenode36 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:41639'
✕ safenode37 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:37865'
✕ safenode38 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:41659'
✓ safenode39 upgraded from 0.105.2 to 0.105.6
✓ safenode40 upgraded from 0.105.2 to 0.105.6
✕ safenode41 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:34799'
✕ safenode42 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:46689'
✕ safenode43 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:44065'
✕ safenode44 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:43509'
✕ safenode45 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:41283'
✕ safenode46 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:43777'
✕ safenode47 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:43419'
✓ safenode48 upgraded from 0.105.2 to 0.105.6
✕ safenode49 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:42031'
✕ safenode50 was not upgraded: Error: Could not connect to RPC endpoint 'https://127.0.0.1:38275'

1 Like

That’s fine, I’ll get that added.

Do you have the output from before the summary? It would be interesting to see if the nodes are not starting or stopping correctly.

1 Like

the out put from safenode-manager status ?

I can run it again on another machine but I think the end result will be another borked machine as I can no longer run safenode-manager status on that machine.

I meant from the upgrade command. You’ve only pasted the summary. If you had the previous output, that would have been useful.