Node Manager UX and Issues

On a rare occasion but often enough to ask, this happens. just randomly out of nowhere and then it moves on and starts the next just fine.
image

2 Likes

Itā€™s probably trying to connect to the RPC service too quickly before it has been initialised.

5 Likes

Does anyone else often get the problem with being stuck on ā€˜refreshing the node registryā€™ when trying to stop / reset / do anything else with safenode-manager?

So far Iā€™ve been sorting this by reinstalling everything :smile: ā€¦ Iā€™m hoping thereā€™s a better solution, so input would be much appreciated!

Edit: Iā€™ve been advised it can take 30 minutes plus to get past this, but give it some time & hopefully itā€™ll sort itself out. I will give patience a shot!

I believe thatā€™s due to the number of nodes running. It should be a reasonable time and manageable for a couple of dozen nodes but safenode-manager wasnā€™t designed for hundreds of nodes.

How many did you have at the time?

1 Like

The registry refresh should not take too long, so something is happening here that shouldnā€™t be. I donā€™t think Iā€™ve seen any reports about this before.

I think there would be two possible causes for this problem. As a first possibility, the refresh connects to each nodeā€™s RPC service. That service might be dead, or otherwise taking a long time to respond. We might need to put an explicit timeout on the connection. For the second possibility, the node manager attempts to determine if the process launched by the service is still running, and it does so using an external crate called sysinfo. The node manager calls a function on this crate which refreshes the system. It could be possible this would cause a stall, but Iā€™d imagine this is doing the same kind of thing that something like top/htop does, so that shouldnā€™t be a particularly expensive operation.

I am still on holiday just now, but, if you encounter this again, I would see if you can try and determine if the node RPC services are still running. You can get the port numbers from the node_registry.json file, and maybe just try using netcat or something to see if the port is still responding.

1 Like

320 nodes, so yeah, this seems to be the issue. It did work eventually (~30-40 minutes perhaps), but I might have been quicker to rebuild the system.

I was restarting the nodes because after quite a while running, memory and CPU usage was growing significantly.

After restarting, it was all back to a lower level, so it may be that restarting nodes every day or two helps with resource managementā€¦ though ideally this wonā€™t be necessary in time as the software improves.

2 Likes

That is a lot of nodes, but, I still donā€™t think it should take that long. I often test the node manager on a small Vagrant box with 50 nodes, and even there, the refresh is only taking something on the order of 3 to 5 seconds.

Even with a large number of nodes, if itā€™s taking minutes, something is wrong.

2 Likes

I donā€™t think this is scaling linearlyā€¦ Iā€™m not running more than about 100 nodes and about there it starts to really take timeā€¦

2 Likes

I understand, but intuitively, it doesnā€™t seem like there is a good reason for it to not scale linearly. Itā€™s a sequential process, and at least on paper, each step of that process doesnā€™t really do very much. So there might be other things at play here that cause it to go non-linear. Putting some more logging into the system might help to identify what the bottleneck really is.

3 Likes

320 nodes, that is going to take a long time from experience. node-manager take 5-10 seconds (some systems faster/slower) to refresh around 15-25 nodes. But more importantly the spike in CPU usage will cause more nodes to slow down even more in a rather cascading effect.

But I agree that 1/2 seems to be like Chriso says some nodeā€™s processing is causing significant delays.

Once the CPU spikes to 100% doing the RPC calls then linear is out the window on the 50th floor. And 320 is going to do that for sure. But doubt 1/2 hour and more likely as you say some node(s) are holding up the show as you said

For such a low cpu usage program the RPC are a HUGE cpu usage in comparison spiking cpu/thread usage to 100% for significant processing time

4 Likes

Windows 11 machine.
Scenario:
15 nodes working for last three days 24/7. --home-network all nodes added with --auto-restart
Last night pc crashed (not because of safenode or any of safe services but gpu drivers were obsolete).
After powering pc only 7 nodes started up.
image
Other nodes cannot be started with safenode-manager.
Tried to stop all nodes and start nodes again but when starting with --interval 30000 only 7 nodes starts, others are providing error:

1 Like

Resolved:
Interestingly, defender (which is switched off) on my machine (after the initial scan upon restart) blocked only 8 safenodes (declaring it as trojan) but other 7 were okay according to microsoft :smiley:

4 Likes

You said it was resolved. How so?

Two words: Microsoft Defender (i have paid version) so unable to kill it.

Any chance we can set log levels for the restart @chriso :crossed_fingers:

1 Like

Hey, sorry, could you clarify this please? Is this another setting that is not being retained on an upgrade?

I might just be slow, can we now set how many logs to keep via node-manager?

You said that you can make this available, perhaps you have and I just didnā€™t realize.

Hey, sorry, itā€™s not available yet. Will try and do it ASAP!

3 Likes
safenode1: The PID of the process was not found after starting it.
āœ• safenode2: The PID of the process was not found after starting it.
āœ• safenode3: The PID of the process was not found after starting it.
āœ• safenode4: The PID of the process was not found after starting it.
āœ• safenode5: The PID of the process was not found after starting it.
āœ• safenode6: The PID of the process was not found after starting it.
āœ• safenode7: The PID of the process was not found after starting it.
āœ• safenode8: The PID of the process was not found after starting it.
āœ• safenode9: The PID of the process was not found after starting it.
āœ• safenode10: The PID of the process was not found after starting it.
Error: 
   0: Failed to start one or more services

Location:
   sn_node_manager/src/cmd/node.rs:759

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

I had nodes for sure running successfully on my laptop but since the laptop is closed so often, my nodes never had a chance to earn. Seems like a real bummer that a closed laptop canā€™t be useful.

I decided to make another attempt with the updated network to get my slightly outdated desktop iMac running nodes.

Anyone know how to solve this so I we can get more diverse everyday computers helping run this network?

1 Like

Closing your laptop puts it to sleep. Nothing can be running when asleep. Just like you cannot build things or be conscious

Its a fundamental issue with the way sleep mode works. To make it so nodes could run then its not sleep mode is it :wink:

BTW do a stop first then start might help with those errors

1 Like