Upgrade from safenode 0.109.0 to 0.110.0 using safenode-manager

Do we need a cheeky little topic for the upgrade? Probably not but here is one anyway in case anyone has wandered in from Discord and in case there are any issues to report.

Intended audience: those running safenodes using safenode-manager

Pre-reqs:-
Having safenodes running at version 0.109.0 using safenode-manager.

Upgrade safenode-manager with:-

safeup node-manager

To see current version of safenodes

safenode-manager status --details

To upgrade all nodes at once:-

safenode-manager upgrade

OR

To upgrade safenodes individually:-

safenode-manager upgrade --service-name safenode1

There is a burst of activity when a node restarts that lasts for a couple of minutes with some records being stored.

That is making me think it’s best to do the upgrades one at a time and let them settle down for maximum niceness for the nodes and your network. I’m sure it’s better for the network as a whole to upgrade them one at a time if you are running a lot of nodes.

So I upgraded 10 down to 6 one at a time waiting until they settle. But I upgraded 1 to 5 all at once. As predicted the little RPi4 and router network got very busy for a couple of mins.

As each node can do several MB/s of downloads at once I think they were being strangled when doing the last 5 all at once.

This graph shows the tail end of one at once and when the last 5 were starting and not getting alll they wanted.

I got this error when I upgraded one of them:-

safenode-manager upgrade --service-name safenode9
╔═══════════════════════════════╗
β•‘   Upgrade Safenode Services   β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
Retrieving latest version of safenode...
Error: 
   0: error sending request for url (https://crates.io/api/v1/crates/sn_node)
   1: client error (Connect)
   2: dns error: failed to lookup address information: Try again
   3: failed to lookup address information: Try again

Location:
   /project/sn_node_manager/src/cmd/mod.rs:78

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

But I ran the command again and it completed the upgrade and seems fine. Although there was another error produced:-

safenode-manager upgrade --service-name safenode9
╔═══════════════════════════════╗
β•‘   Upgrade Safenode Services   β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
Retrieving latest version of safenode...
Latest version is 0.110.0
Using cached safenode version 0.110.0...
Download completed: /home/safe/.local/share/safe/node/downloads/safenode
Refreshing the node registry...
Attempting to stop safenode9...
βœ“ Service safenode9 with PID 44675 was stopped
Attempting to start safenode9...
Upgrade summary:
βœ• safenode9 was upgraded from 0.109.0 to 0.110.0 but it did not start
Error: 
   0: There was a problem upgrading one or more nodes

Location:
   /project/sn_node_manager/src/cmd/node.rs:545

Suggestion: For any services that were upgraded but did not start, you can attempt to start them again using the 'start' command.

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

After that I spotted that there is a new version of safenode-manager. Whoops!

So I updated it with safeup node-manager

Maybe it’s completely unrelated but I didn’t see the error again after that.

The upgrade worked for all of the nodes in the end. Dive in, the water is lovely!

7 Likes

I found doing it without a interval set (didn’t know before hand you could) I found that internet traffic was high for quite a time afterwards till it started to settle down.

Prob want to have a large interval between nodes being upgraded. For those with slower internet speeds then maybe 20 minutes interval, those on giga bps then it doesn’t need to be as long.

I gather the reason for big traffic flows is that when the node starts up again after being upgraded it has a new xor address and is doing two things.

  • uploading the chunks that are now inactive in its store to any node that is now in the closest 5 nodes to that chunk and needs it.
  • downloading chunks that it is responsible for from other nodes.
  • and the normal messaging nodes do

Whereas for a new node without old data (ie adding nodes) it is just downloading chunks on top of normal messaging. And since ISP d/l bandwidth is usually higher than u/p then most not on giga bps up&down adding nodes will be quicker.

So my advice is to at least have an interval longer than you should have. or did, use to add nodes and the slower the upload speed the longer the interval should be

2 Likes

on my side small vps’s upgrade just fine but a 400 node dedicated went into melt down after 88 nodes with a 5 min interval

image

5 Likes

I’m seeing the same did you nuke it and restart the nodes?

2 Likes

Is that the case? I assumed the point of an upgrade was in part to avoid this so am curious.

My experience, in both cases just using upgrade without an interval was excellent:

  • cheapo VPS running 20 nodes. No issues. Nodes have continued earnings. I didn’t check activity levels, just the vdash display which looked fine.
  • dedicated 2core VPS running 50 nodes. safenode-manager reported that some nodes didn’t restart. I didn’t notice at the time, but checking status just now in fact they had all started and all look fine and have continued earning. So @Chriso, maybe lack of an interval can leads sn-manager to think some have not started and report an error?
1 Like

The xor address is not stored on disk, just in memory. So its impossible for the restarting node to know what the old xor address was. So it has to pick a new one.

Also if it did save the xor address then I have a simple and powerful attack vector to take over control of various regions of xor space and could nuke chunks inside that region of xor space. Yes harder as the network gets huge. But my humble threadripper box could decimate the 60K node network in a couple of weeks or perhaps couple of days.

For security the node has to gain a new xor address.

And one mitigation against this attack is the bad node detection of other nodes that would be shunning the restarting nodes for not responding while the node is off and restarting. And this also would harm any node restarting with same xor address.

I saw higher uploading after node restarted than any downloading done by a new node. So it would seem it was giving chunks away, and downloading was like a new node.

1 Like

I was asking officially, is that from the team?

Its from my observations and logic. You were also replying to me so I responded

1 Like

Thanks, but that’s not what I was asking for!

I feel a disturbance in the force machines I have not started to upgrade are all heading north on there load averages, network usage, disk space, and rewards.

3 Likes

I sense an unintended β€œfeature”. :grimacing:

EDIT: maybe a good feature…
If this is due to old nodes acting like a shrinking network, as prices rise clients will stop uploading and upgrade to use the new nodes?

3 Likes

when i noticed the load averages i was geting ready to hit the kill switch thinking i was going into melt down. but since disk and network are up think ill leave it to play out and see what happens.

@chriso is there any time frame for needing to upgrade nodes ?
I want to try and upgrade these nodes but still haven’t figured out how to go about it apart from maybe trying 10 min time out on the next attempt

Here is an example of a service definition for safenode:

[Unit]
Description=safenode1
[Service]
ExecStart=/var/safenode-manager/services/safenode1/safenode --rpc 127.0.0.1:13000 --root-dir /var/safenode-manager/services/safenode1 --log-output-dest /var/log/safenode/safenode1 --log-format json --port 39214 --metrics-server-port 14000 --owner maidsafe --peer /ip4/142.93.45.95/udp/51702/quic-v1/p2p/12D3KooWCqhBgvNApaTLPryvNB6dJcRue2DCcQo3RjKCYh3QHSXi
Restart=on-failure

Notice the Restart=on-failure part. I think what happens here is, the service fails to start for some reason (most likely because it can’t connect to the RPC service), but then Systemd tries to start it again, and that time it runs. The next time you run status, it will then report the service as running.

3 Likes

There aren’t any hard limits here, no. Just upgrade at your leisure.

I think you are running 400 nodes? I’m not even sure this is an issue with the node manager per se, it just seems it’s difficult to have that many nodes, on one machine, trying to upgrade within fairly quick succession.

We only actually have 25 nodes on our own droplets. I did try and ask if we could look into setups with a large amount of nodes on one machine, but was told it’s just not a priority for us just now. So, if you’re taking the chance at putting hundreds of nodes on one machine, there’s not too much advice I can give just now. Perhaps you could just use --service-name on the upgrade command and gradually do batches of them?

7 Likes

those machines are not upgrading that are going north on everything they are still steady as she goes on the old version.

yes I might try a loop doing one node at a time with large time out and report back if I have some luck.

2 Likes

Have you not been able to get any of the 400 nodes upgraded?

no I test ran and upgrade on one machine last night with 5 min time out and I went into full melt down after 88 nodes. some screen shots further up this thread.

so I’m now at the drawing board of what to do next was thinking today ill try it again with 10 min time out and see if that sorts the problem but it would mean a few days to upgrade

To be honest, given the state of play just now, and with a lot of nodes on one machine, you might just be looking at taking quite a bit of time to upgrade them. I’d probably go for the batched --service-name approach.

3 Likes

Im doing them in batches. I had 400 nodes - all stopped

I have two batch jobs running One does the upgrade with --do-not-start and the other actually starts the node

safe@wave1-bigbox:~$ for i in {320..400}; do sudo /home/safe/.local/bin/safenode-manager upgrade --do-not-start --service-name safenode$i; sleep 570; done &
safe@wave1-bigbox:~$ for i in {117..150}; do sudo /home/safe/.local/bin/safenode-manager start --service-name safenode$i  ;sleep 301;  done

I have the upgrade job running in front of the start job. This is keeping my load average within sensible limits and so far seems to be going well.
Yes its slow but not as slow as having to start again cos the load avg went mental and the box crawls to a halt.

3 Likes

Confirmed. I don’t have graphs for CPU but this is the switch port with the RPi4 with 10 nodes.

The blip just before midnight yesterday was the upgrade. Network usage went down. Then something happened around 1400 today and network usage went up.

I’ve seen the network has shrunk by about a 3rd according to https://network-size.autonomi.space Can’t remember whose excellent work that was.

I’m personally not seeing an increase in earnings. In fact, not had any since the upgrade. But then 10 nodes is small sample size.