Announcement: Preparing for Today’s Network Upgrade

rusty.spork · November 7, 2024, 8:34pm

Announcement: Preparing for Today’s Network Upgrade

Please follow these instructions to upgrade your nodes to the newest version to ensure the best performance and stability!

For Node Launchpad Users:

Open Node Launchpad v0.4.4
Press O to access the Options screen.
Then, press Ctrl + U, and hit Enter. This will upgrade your nodes. Upgrading can take several minutes for each node. Please don’t close the app or stop your nodes during this process.
Your nodes will now stop
Press Ctrl + S to start your nodes again

For CLI Tool Users:

If you’re using the CLI tool, please upgrade by running this command:
safenode-manager upgrade --interval 60000

For ALL Users:

Please start your nodes gradually — especially if you plan on running multiple nodes.
Be conservative with CPU allocation to maintain stability
Uploads have already resumed

Binary Versions:

nat-detection: v0.2.10
node-launchpad: v0.4.3
autonomi: v0.1.4
safenode: v0.112.3
safenodemand: v0.11.2
safenode-manager: v0.11.2

storage_guy · November 7, 2024, 9:28pm

Had an error doing the first upgrade:-

safenode-manager upgrade --service-name safenode1
╔═══════════════════════════════╗
║   Upgrade Safenode Services   ║
╚═══════════════════════════════╝
Retrieving latest version of safenode...
Latest version is 0.112.3
Downloading safenode version 0.112.3...
Download completed: /home/ubuntu/.local/share/safe/node/downloads/safenode
Refreshing the node registry...
Attempting to stop safenode1...
✓ Service safenode1 with PID 13307 was stopped
Attempting to start safenode1...

Upgrade summary:
✕ safenode1 was upgraded from 0.112.2 to 0.112.3 but it did not start
Error: 
   0: There was a problem upgrading one or more nodes

Location:
   /project/sn_node_manager/src/cmd/node.rs:604

It’s 1 node that runs on an AWS Instance for testing things first. Not the most powerful of CPUs but it is fine for 1 node. CPU was only at about 30% while doing the upgrade.

However, for a couple of mins after this message was produced it was still doing the validation of records (is that what it is or discarding them?) that produces these messages in the log:-

[2024-11-07T21:19:09.663174Z INFO sn_networking::record_store] Failed to decrypt record from file "41ffe05807368068bd0d46fd09184cea7f6dcfdfd584c97645dba3c83c7a2f70", clean it up.
[2024-11-07T21:19:09.845007Z ERROR sn_networking::record_store] Error while decrypting record. key: Key(b"`Z\x8ad~\xd6`tV/^\xe3>\xa1\xe2\x93\x15k\xf8\xd4R\x15\xf3\x95\xcf\x9dT\x9a>J\xa3\x10"): Error
[2024-11-07T21:19:09.8

And then I see the node has actually started and is producing normal messages in the log.

safenode-manager status shows it as being RUNNING and there are peers and says it is at 0.112.3.

And it’s done a lot of PUTs and GETs already.

So watch out for this! You might think an upgrade has failed but it hasn’t really.

chriso · November 7, 2024, 10:16pm

This is something that happens sometimes when upgrading, particularly if you have quite a few nodes–one of them might fail to start after obtaining the new version. The restart policy on the service definition is for it to try again on failure, so you will generally find it will ‘heal’ itself shortly after it failed to start.

The error message is also trying to indicate that the upgrade itself has been OK, but it’s the subsequent attempt to start that is the problem.

chriso · November 7, 2024, 10:36pm

Also, Qi informed me that those messages in the log are to be expected. They are in alignment with one of the changes we made in this release.

storage_guy · November 7, 2024, 10:43pm

Thank you for that @chriso.

There is something else I am confused about. When it became clear there was going to be an upgrade I stopped and reset my 3 home nodes. But I’d misinterpreted what was going on and didn’t realise it would be an upgrade so no need to start new nodes. So yesterday afternoon I started up 3 nodes again. Now when I try to upgrade it says I’m already on 0.112.3 version:-

Retrieving latest version of safenode...
Latest version is 0.112.3
Using cached safenode version 0.112.3...
Download completed: /home/safe/.local/share/safe/node/downloads/safenode
Refreshing the node registry...
✓ All nodes are at the latest version

And no upgrade happens.

The start time of first node was:-
[2024-11-06T17:56:40.582970Z

I started a 4th node which didn’t need to download a new version.

So did I jump the gun by about 24 hours?! But it still worked?

Now that the new node has settled down I can’t see much difference with the nodes in terms of CPU and RAM.

So I think I was on the new version 24 hours early!

I’ve not earned anything from these nodes (not desperately worried - it’s a small sample size) but when I do I suppose that will be the proof that all is well.

neo · November 7, 2024, 11:29pm

Can we have a detailed summary of the process happening here.

how exactly is the upgrade happening? Is the node program stopped, new version (already downloaded) started? OR is there some sort of in memory upgrade going on?
- I guess its the normal stop all nodes downloaded new version
What happens to the peerID, does it change? IE a new ID and XOR address
What happens to the record_store? How is it decrypted since the in-memory keys for decrypting would evaporate when the node is stopped?

These are very important questions since the peerID is supposed to change now. And important in knowing how data is retained on the network if all nodes are stopped prior to the upgrade.

@rusty.spork can you please get good answers to these simple questions with the associated reasoning for asking them so the answers can address the reasons as well.

chriso · November 7, 2024, 11:31pm

Yeah, the release has been available for a while. We just didn’t give the go-ahead to upgrade because we were trying to get some things together with the launchpad.

chriso · November 7, 2024, 11:34pm

For the upgrade, the node stops, the binary is replaced, then it is started again.

In the case of the upgrade, the peer ID is retained.

neo · November 7, 2024, 11:41pm

So we are ignoring David saying it should change and is a huge attack vector. I can exploit this on day 1 to own portions of the network to cause trouble if I wanted (which I don’t) but still its an attack vector that David recognised. And having chunks held by more than the 5 closest to help doesn’t solve it since the attack is more than just owning records

Where is the record_store decrypt key held then, must be on disk or can be determined from things like secret_key??

chriso · November 7, 2024, 11:42pm

It isn’t being ignored. We are aware of it.

neo · November 7, 2024, 11:44pm

Any answer to the other question?

chriso · November 7, 2024, 11:45pm

Sorry, I don’t have all the answers all of the time. I would need someone else to come in on that one.

tobbetj · November 7, 2024, 11:58pm

Can that be written to the get started guide or similar? Also is it possible to make nodes check if cpu usage is above 50% more than 5 times during 5 minutes.

When nodes being stopped is it ok to later try and start them again or is it better to reset nodes? Does it matter how long a node has been stopped when choosing to try and start some, will stopped nodes be seen as bad by the network when they start again?

neo · November 8, 2024, 12:12am

Subsequent question and maybe related to @tobbetj question

If the node is started again for instance 1/2 hour or so later, then because it has the same peerID will it then be in a state of being completely shunned by the peers it had before being stopped for update. 1/2 hour should be long enough for all the 200 (approx) peers to have shunned it for non-communicating.

EDIT: If this is right then maybe even 6 minutes later there could be a lot of shunning of that node and this in itself will cause large problems and potential segmenting with islands of nodes joined by a few nodes not shunning either island, and many times routing cannot reach some “islands” from other places in the XOR space

dirvine · November 8, 2024, 9:59am

Yes a few things in play here and some context, things to think about

To get the keys attackers need to get in the machine (not readable remotely via any api etc.)
If attackers in machine then they get whatever they wish regardless.
The identity_key is just node name, it cannot do anything with your wallet or any data etc.
In total network fail nodes need to restart and start serving data (no node shining during them all off line etc. but it’s more subtle)
Wallet keys are seperate an unrelated to identity keys
Users can start nodes fresh so all new keys etc,

So some issues to deal with

Data on disk for nodes should not be human readable, therefor nodes encrypt this (using an algo based on secret key). This prevents injection of malice attack where folk bypass the client API and store “bad” data, whoever that is deemed to be in what jurisdiction.
If a node restarts all it’s data is essentially gone (unless it uses same id key)
If upgrade or crash of an individual node, the effect is trivial
If above is network wide, the effect is data loss

So some context, there’s a lot can be done in any of these scenarios, but having nodes with id keys able to do no network manipulation was a big step.

Having reusable identity keys is a security hole, but of what size and cost is not so simple to see. It’s also many layers of security cna be in place here and some OS’s make network intrusion and breaking much harder to do these days. However is such intrusion does happen, reusable identity keys will be the least of your worries really. If nodes behave bad they will be shunned and so on.

All of these are issues to consider and we cannot take a single angle of it’s 100% OK or it’s 100% doomed as none o fates angles are correct.

neo · November 8, 2024, 12:01pm

The one immediate concern I have though is the last post. And I understand if this is too large a problem to fixed in the near future, but seems from past experience with these large nodes (& large max chunk size) that we see nodes joining a small network and not joining in the whole network as reported by some people after a collapse.

launchpad stops the nodes to do upgrades
the user has to restart the new upgraded nodes
the nodes when started will have the same peerID
the user may have gone to get coffee or any other thing
the user comes back some time later (could easily be over 10 minutes to hours)
the user starts the nodes again gaining the same peerID etc
BUT the other nodes in the network have marked those nodes as bad, due to non-responsiveness, and refuse to talk to the nodes when started by the user.
the only nodes the user’s nodes can talk to are new ones that do not know about the old nodes.
This can cause segmentation and/or incomplete understanding of the close nodes and these nodes seeing a much smaller network will assume they are responsible for a larger range of xor addresses.
and island effects and instability in the network as a whole if too many people have this happen.

This is my immediate concern, over any other attack vectors I can mention if anyone is interested.

EDIT: see this post for the solution and should be easy enough to implement without touching the node code. Announcement: Preparing for Today’s Network Upgrade - #25 by neo

Josh · November 8, 2024, 12:16pm

I think in the past a node would keep all records even those it was no longer responsible for.

It would cleanup once full.

Is that correct and still the case?

@qi_ma maybe.

peca · November 8, 2024, 3:01pm

I agree, it is a dangerous scenario.

Maybe we could make shunning temporary instead of permanent. Removing nodes from shun-list after 24h would help integrating “isolated islands” back into network after blackouts, updates, etc. It could be self-healing mechanism of the network, but I am not sure how much it would weaken the protection against nodes that are truly bad or malicious.

qi_ma · November 8, 2024, 5:07pm

It’s called pruning, and now will be triggered much earlier, 10% of the max_records as I remember.

Josh · November 8, 2024, 5:08pm

Thank you @qi_ma

Topic		Replies	Views
BasicEconomyTweaks [Early Technical Beta] Part Deux [Offline] Releases	488	3436	May 6, 2024
Step-by-step: the road to Fleming, 5: Network upgrades Development development , fleming , routing	44	3305	May 19, 2019
Step-by-step: the road to Fleming, 1: The Big Questions: SAFE Fleming and Beyond Development development , fleming , routing	30	3338	July 11, 2019
Nov 14 Network Upgrade Updates	22	391	November 28, 2024
Node Upgrade Alert! Time to Update Again! Updates beta	84	890	September 22, 2024

Announcement: Preparing for Today’s Network Upgrade

Related topics