Node Manager UX and Issues

On a rare occasion but often enough to ask, this happens. just randomly out of nowhere and then it moves on and starts the next just fine.
image

2 Likes

It’s probably trying to connect to the RPC service too quickly before it has been initialised.

5 Likes

Does anyone else often get the problem with being stuck on ‘refreshing the node registry’ when trying to stop / reset / do anything else with safenode-manager?

So far I’ve been sorting this by reinstalling everything :smile: … I’m hoping there’s a better solution, so input would be much appreciated!

Edit: I’ve been advised it can take 30 minutes plus to get past this, but give it some time & hopefully it’ll sort itself out. I will give patience a shot!

I believe that’s due to the number of nodes running. It should be a reasonable time and manageable for a couple of dozen nodes but safenode-manager wasn’t designed for hundreds of nodes.

How many did you have at the time?

1 Like

The registry refresh should not take too long, so something is happening here that shouldn’t be. I don’t think I’ve seen any reports about this before.

I think there would be two possible causes for this problem. As a first possibility, the refresh connects to each node’s RPC service. That service might be dead, or otherwise taking a long time to respond. We might need to put an explicit timeout on the connection. For the second possibility, the node manager attempts to determine if the process launched by the service is still running, and it does so using an external crate called sysinfo. The node manager calls a function on this crate which refreshes the system. It could be possible this would cause a stall, but I’d imagine this is doing the same kind of thing that something like top/htop does, so that shouldn’t be a particularly expensive operation.

I am still on holiday just now, but, if you encounter this again, I would see if you can try and determine if the node RPC services are still running. You can get the port numbers from the node_registry.json file, and maybe just try using netcat or something to see if the port is still responding.

1 Like

320 nodes, so yeah, this seems to be the issue. It did work eventually (~30-40 minutes perhaps), but I might have been quicker to rebuild the system.

I was restarting the nodes because after quite a while running, memory and CPU usage was growing significantly.

After restarting, it was all back to a lower level, so it may be that restarting nodes every day or two helps with resource management… though ideally this won’t be necessary in time as the software improves.

2 Likes

That is a lot of nodes, but, I still don’t think it should take that long. I often test the node manager on a small Vagrant box with 50 nodes, and even there, the refresh is only taking something on the order of 3 to 5 seconds.

Even with a large number of nodes, if it’s taking minutes, something is wrong.

2 Likes

I don’t think this is scaling linearly… I’m not running more than about 100 nodes and about there it starts to really take time…

2 Likes

I understand, but intuitively, it doesn’t seem like there is a good reason for it to not scale linearly. It’s a sequential process, and at least on paper, each step of that process doesn’t really do very much. So there might be other things at play here that cause it to go non-linear. Putting some more logging into the system might help to identify what the bottleneck really is.

2 Likes

320 nodes, that is going to take a long time from experience. node-manager take 5-10 seconds (some systems faster/slower) to refresh around 15-25 nodes. But more importantly the spike in CPU usage will cause more nodes to slow down even more in a rather cascading effect.

But I agree that 1/2 seems to be like Chriso says some node’s processing is causing significant delays.

Once the CPU spikes to 100% doing the RPC calls then linear is out the window on the 50th floor. And 320 is going to do that for sure. But doubt 1/2 hour and more likely as you say some node(s) are holding up the show as you said

For such a low cpu usage program the RPC are a HUGE cpu usage in comparison spiking cpu/thread usage to 100% for significant processing time

2 Likes