On a rare occasion but often enough to ask, this happens. just randomly out of nowhere and then it moves on and starts the next just fine.
Itās probably trying to connect to the RPC service too quickly before it has been initialised.
Does anyone else often get the problem with being stuck on ārefreshing the node registryā when trying to stop / reset / do anything else with safenode-manager?
So far Iāve been sorting this by reinstalling everything ā¦ Iām hoping thereās a better solution, so input would be much appreciated!
Edit: Iāve been advised it can take 30 minutes plus to get past this, but give it some time & hopefully itāll sort itself out. I will give patience a shot!
I believe thatās due to the number of nodes running. It should be a reasonable time and manageable for a couple of dozen nodes but safenode-manager wasnāt designed for hundreds of nodes.
How many did you have at the time?
The registry refresh should not take too long, so something is happening here that shouldnāt be. I donāt think Iāve seen any reports about this before.
I think there would be two possible causes for this problem. As a first possibility, the refresh connects to each nodeās RPC service. That service might be dead, or otherwise taking a long time to respond. We might need to put an explicit timeout on the connection. For the second possibility, the node manager attempts to determine if the process launched by the service is still running, and it does so using an external crate called sysinfo
. The node manager calls a function on this crate which refreshes the system. It could be possible this would cause a stall, but Iād imagine this is doing the same kind of thing that something like top
/htop
does, so that shouldnāt be a particularly expensive operation.
I am still on holiday just now, but, if you encounter this again, I would see if you can try and determine if the node RPC services are still running. You can get the port numbers from the node_registry.json
file, and maybe just try using netcat or something to see if the port is still responding.
320 nodes, so yeah, this seems to be the issue. It did work eventually (~30-40 minutes perhaps), but I might have been quicker to rebuild the system.
I was restarting the nodes because after quite a while running, memory and CPU usage was growing significantly.
After restarting, it was all back to a lower level, so it may be that restarting nodes every day or two helps with resource managementā¦ though ideally this wonāt be necessary in time as the software improves.
That is a lot of nodes, but, I still donāt think it should take that long. I often test the node manager on a small Vagrant box with 50 nodes, and even there, the refresh is only taking something on the order of 3 to 5 seconds.
Even with a large number of nodes, if itās taking minutes, something is wrong.
I donāt think this is scaling linearlyā¦ Iām not running more than about 100 nodes and about there it starts to really take timeā¦
I understand, but intuitively, it doesnāt seem like there is a good reason for it to not scale linearly. Itās a sequential process, and at least on paper, each step of that process doesnāt really do very much. So there might be other things at play here that cause it to go non-linear. Putting some more logging into the system might help to identify what the bottleneck really is.
320 nodes, that is going to take a long time from experience. node-manager take 5-10 seconds (some systems faster/slower) to refresh around 15-25 nodes. But more importantly the spike in CPU usage will cause more nodes to slow down even more in a rather cascading effect.
But I agree that 1/2 seems to be like Chriso says some nodeās processing is causing significant delays.
Once the CPU spikes to 100% doing the RPC calls then linear is out the window on the 50th floor. And 320 is going to do that for sure. But doubt 1/2 hour and more likely as you say some node(s) are holding up the show as you said
For such a low cpu usage program the RPC are a HUGE cpu usage in comparison spiking cpu/thread usage to 100% for significant processing time
Windows 11 machine.
Scenario:
15 nodes working for last three days 24/7. --home-network all nodes added with --auto-restart
Last night pc crashed (not because of safenode or any of safe services but gpu drivers were obsolete).
After powering pc only 7 nodes started up.
Other nodes cannot be started with safenode-manager.
Tried to stop all nodes and start nodes again but when starting with --interval 30000 only 7 nodes starts, others are providing error:
Resolved:
Interestingly, defender (which is switched off) on my machine (after the initial scan upon restart) blocked only 8 safenodes (declaring it as trojan) but other 7 were okay according to microsoft
You said it was resolved. How so?
Two words: Microsoft Defender (i have paid version) so unable to kill it.
Any chance we can set log levels for the restart @chriso
Hey, sorry, could you clarify this please? Is this another setting that is not being retained on an upgrade?
I might just be slow, can we now set how many logs to keep via node-manager?
You said that you can make this available, perhaps you have and I just didnāt realize.
Hey, sorry, itās not available yet. Will try and do it ASAP!
safenode1: The PID of the process was not found after starting it.
ā safenode2: The PID of the process was not found after starting it.
ā safenode3: The PID of the process was not found after starting it.
ā safenode4: The PID of the process was not found after starting it.
ā safenode5: The PID of the process was not found after starting it.
ā safenode6: The PID of the process was not found after starting it.
ā safenode7: The PID of the process was not found after starting it.
ā safenode8: The PID of the process was not found after starting it.
ā safenode9: The PID of the process was not found after starting it.
ā safenode10: The PID of the process was not found after starting it.
Error:
0: Failed to start one or more services
Location:
sn_node_manager/src/cmd/node.rs:759
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
I had nodes for sure running successfully on my laptop but since the laptop is closed so often, my nodes never had a chance to earn. Seems like a real bummer that a closed laptop canāt be useful.
I decided to make another attempt with the updated network to get my slightly outdated desktop iMac running nodes.
Anyone know how to solve this so I we can get more diverse everyday computers helping run this network?
Closing your laptop puts it to sleep. Nothing can be running when asleep. Just like you cannot build things or be conscious
Its a fundamental issue with the way sleep mode works. To make it so nodes could run then its not sleep mode is it
BTW do a stop first then start might help with those errors