Node Manager UX and Issues

wes · April 5, 2024, 12:51pm

Without totally rewriting the code, it seems that you could use safenode-manager just once. The node manager output is a great resource that re-implementing seems counter productive.

In the code that I’m still using:

peer_id=$(safenode-manager status --details | grep -A 5 "$dir_name - RUNNING" | grep "Peer ID:" | awk '{print $3}')

Is in the for loop. I’m no bash guru by any means, but couldn’t you do something like:

node-overview = $(safenode-manager status --details)

*in for loop*
peer_id=$(node-overview | grep -A 5 "$dir_name - RUNNING" | grep "Peer ID:" | awk '{print $3}')

[no idea if that code works, but hopefully you get the idea]

That way you’re not hitting the node status 50-60-100 times for the larger node setups? Yes, the status may be a minute or two out of date once it gets to the end nodes, but seems it would be much less overhead and stress on the node manager.

See: BasicEconomyTweaks [Early Technical Beta] - #518 by wes

aatonnomicc · April 5, 2024, 2:57pm

nice one @wes new version updated as you recommended only slight change is it outputs the safenode-manager status to file then greps it for the individual details

github.com

safenetforum-community/TIG_Stack/blob/main/influx-resources.sh

#!/bin/bash

export PATH=$PATH:$HOME/.local/bin

base_dirs="/var/safenode-manager/services"

# influx db and grafana need data that is to be worked with to have the same time stamp this rounds time to the nearest 100 seconds to
# attempt to keep all machines in sync for time purposes
influx_time="$(echo "$(date +%s%N)" | awk '{printf "%d0000000000\n", $0 / 10000000000}')"



######################################################
# coin gecko gets upset with to many requests this atempts to get the exchange every 15 min
# https://www.coingecko.com/api/documentation
######################################################
time_min=$(date +"%M")
CALCULATE_EARNINGS=0
# execute script every 15 min if the tig stack folder is present
if [ $time_min == 0 ] || [ $time_min == 15 ] || [ $time_min == 30 ] || [ $time_min == 45 ]

This file has been truncated. show original

chriso · April 5, 2024, 3:39pm

Just to be clear, I’m talking about only the status command.

Southside · April 5, 2024, 3:41pm

Ok that makes sense

it works too!!! always a good thing…

Look Ma!! no sudo !!

willie@gagarin:~$ /home/willie/.local/bin/safenode-manager status
=================================================
                Safenode Services                
=================================================
Refreshing the node registry...
Service Name       Peer ID                                              Status  Connected Peers
safenode4          12D3KooWQegfYn8aorgHDU27aXeLF1VyWsD3R4rUVs1Ka36q7xU4 RUNNING               9
safenode5          12D3KooWHjVKdWMrqra3fUivBaq5aU3ekWbi7fCVUaHQim9v53fd RUNNING               9
safenode6          12D3KooWNQrFuJb7rhHyw1NNFgMRucP33aTm1jRz1oxU41mrLuNW RUNNING               0
safenode8          12D3KooWJogFcXGx3rs1vBDsVjCtxyTLmPKcK1WsCECMUwbgF1XS RUNNING             122
safenode9          12D3KooWCPURQ1LjBUKiUvsbVCrWioZVVoeomU7kWbH8cjMPTGcC RUNNING              26
safenode10         12D3KooWNiVKU9QH6jVHZXAPKZyRNRQUrpP1jR7paFsxAfsMyt9o RUNNING              34
safenode11         12D3KooWEYxhHSpk1wujgCV6UUFYDsUUPHCpedAmFQB28pwKZtgm RUNNING              29
safenode12         12D3KooWHz7uLpavHw3BrMViAZ4Sos6ZQejmBTuajPmTUj8UZ81H RUNNING               0
safenode13         12D3KooWLaWKu1V4N9oQuUnuvfLsKmBf9yndVTQ8eUEkpkg64oZW RUNNING              30
safenode14         12D3KooWA1UoGLRNURZMyz7UBm4KgNu2GaZD83uuJCQimRoyk88Y RUNNING              59
safenode15         12D3KooWRxE2VYBoh6CvjFAmFkUCfLwNdtfyL58FUcfWRYDErcuV RUNNING              35
safenode16         12D3KooWLupjgFyq7ChVSpVrFu5FB8XpmNmHr2ygyX5Sr2hxKmzv RUNNING              13
safenode17         12D3KooWKfdAgN5FFbpdqsgbnTE3J4oF6rj1DVMHDFVKeW9AqqRd RUNNING              18
safenode18         12D3KooWJseSwoBk2ZVbGRjttqPt4mCNws9AnG6WMn1c3pujhczn RUNNING              50
safenode19         12D3KooWN3ZB3FWJo8oZaAucu3g5M5DkLKHzkS8uBuFqvvbGj2oH RUNNING              12
safenode20         12D3KooWNM3hJpaPBoFvU2xRbsuGB8v8KqDfBjzVHnJkmF1tPxKL RUNNING              16
safenode21         12D3KooWJk61jEwAYGtpPswZ2v1hxtyAR8H1ddYZeKyb2J1S9DkE RUNNING              69
safenode22         12D3KooWM97D2QLnXzXYJLGzstbQX3F3oAueeEiNpngCYCHJmw5D RUNNING              17
safenode23         12D3KooWKC5zWkpDmUKoztxjhLBqmmw7vPAG9EvdgaW1WUDibJzc RUNNING              10
safenode24         12D3KooWS9x5nNR38yfEXyFAjoYcMtRSX38ZL8s4jvYfShX6wkcC RUNNING              12
safenode25         12D3KooWCyS1SSC1B4EoNhTr4v8HLG2D29VMzq9eCR6UhjJ9hMFC RUNNING              70
safenode26         12D3KooWSgSMzGey7U11LpoVdT2dBbf66dgHsL6JS3gbLa2hko3C RUNNING              69
safenode27         12D3KooWM3Et2k75ZVgyS875zjm91NW9gUfZZeTVMrjpNxvSFBPb RUNNING              72
safenode28         12D3KooWBahqz5y3zwevDay9rysxSGoyxHeDBVovRZtZttquWhER RUNNING              14
safenode29         12D3KooWJn6y4NxZws3tpt7nbbPwtDND56BkvatxtTtATJUgaVjQ RUNNING              30
safenode30         12D3KooWB9MAgb1gzCcufXdYg9vm6dhhG6E7ZxxxXdzqxNPJuccY RUNNING              69
safenode31         12D3KooWS3wDeRyuFh8Ds847UnNksrQRZ2jGYoM3K2MaRQmT7smu RUNNING              47
safenode32         12D3KooWG61biXyi1LH3fgVqpaVvd5jM12ZJ3b2CwYfH73sfaKWr RUNNING              11
safenode33         12D3KooW9xvVtQhbhdzZyVgsxr27kmwib4TcBYzp37JmKj62Mc4s RUNNING              62

Josh · April 5, 2024, 3:59pm

@chriso see the discrepancy here, between the versions safeup update finds and what safenode --version returns.
I get what’s going on here but doesn’t seem intuitive.

PS C:\Users\kyte7> safeup update
**************************************
*                                    *
*          Updating components       *
*                                    *
**************************************
Retrieving latest version for safe...
Latest version of safe is 0.90.4-alpha.5
There is no previous installation for the safe component.
No update will be performed.

Retrieving latest version for safenode...
Latest version of safenode is 0.105.6-alpha.4
There is no previous installation for the safenode component.
No update will be performed.

Retrieving latest version for safenode-manager...
Latest version of safenode-manager is 0.7.4-alpha.1
There is no previous installation for the safenode-manager component.
No update will be performed.

PS C:\Users\kyte7> safenode --version
safenode cli 0.105.3
PS C:\Users\kyte7> safe --version
sn_cli 0.90.2

chriso · April 5, 2024, 4:02pm

Yeah, the semantic version specification won’t consider those alpha versions to be newer.

What I think we need to do is to incorporate the release channel into safeup, and probably the node manager too. So it would have something like a --channel argument, where you would use alpha as the value.

wes · April 5, 2024, 5:01pm

@chriso @aatonnomicc

TLDR at end.

The “can’t run a bunch of nodes through safenode-manager” has been at least partially cracked.

The original TIG stack script had a cron that ran every 5 minutes to pull data about the nodes. Neik has upped it to 15 (I had not) and saw some improvement.

The people running “a lot of nodes” of course need ( )a dashboard to be able to see what happening with the nodes / system.

The logging script had

safenode-manager status --details

inside the node loop to pull out the data. This is a non-issue for small node setups and a perfectly logical way to do it.

As the number of nodes grows, it takes longer and longer to run the loop and in my 5-minute case, it looped over itself and was running 3 times concurrently.

(the following is my assumption)
The safenode-manager status command starts off with “refreshing node registry”. I haven’t looked into the code, but I have a sneaky suspicion refresh includes writing. This would explain my corrupt json file It would also explain my inability to add more than 50 nodes because the file keeps getting overwritten by the looped “refresh” of the status command.

I had the idea to cache the status output and no idea how to implement (not a bash guy) but @aatonnomicc to the rescue.

This gets the node registry to stop “refreshing” constantly and allow proper writes. I have successfully added nodes 51-60 into the alpha network using safenode-manager add / start.

TLDR; If you have a lot of nodes and are having issues, and using some form of logging / status script, make sure it’s not overusing safenode-manger status. Status is a “heavy” operation as your node count increases.

chriso · April 5, 2024, 5:08pm

Thanks for your input here.

The status command does a refresh because we want to know the status of the nodes as they actually are now. The problem is, a user can interfere with things the node manager is keeping track of, by manually killing processes or services. The OS service infrastructure can also restart the services if they die, which means a new PID will be assigned to the process.

If it’s taking an excessively long amount of time, this refresh idea may have to be revisited somehow.

wes · April 5, 2024, 5:19pm

I don’t think that on its own it takes too long fully agree that the now information is needed for the status.

real    0m5.907s
user    0m3.950s
sys     0m14.311s

At 50 nodes with the script running every 5 minutes the script execution time is the same as the cron time because the status call was in the loop for every node.

6sec * 50 nodes = 300 sec / 60 = 5 minutes.

My assumption is that at the 15 minute cron with the old script would have topped out at ~100 nodes (time above is just status, not the rest of the script)

All of that to say, I don’t think there’s an issue with the “status” run time length or the node manager implementation. I think this was a user level script that we (those running a lot of nodes) all happened to be running that was causing issues as the number of nodes on the system grew.

chriso · April 5, 2024, 5:21pm

OK, thanks.

We can certainly look at some things though, if the command is taking too long.

Like, if the user has, say 100 nodes, we could take groups of 10 and refresh those groups in parallel, or something like that.

wes · April 5, 2024, 5:24pm

If we’re going to go down that route, I personally would rather see the status command take a --peer argument (or multiple --peer arguments) than have it automate its fetching scheme with some rigid grouping.

chriso · April 5, 2024, 5:25pm

Can you explain what difference you think that would make please?

wes · April 5, 2024, 5:46pm

We’re designing this to run on almost anything, but there will absolutely be those of us running quite a few.

It would allow us to section off our own nodes into groups. Say an update comes out and I only want to update 10% of them to make sure they’re good before rolling out the rest of the update I can automate checking those nodes. Not a big deal when running 50 nodes, but what if I’m running 500 nodes? That’s a lot of excess querying to stat 50 nodes.

Same grouping idea for other types of network optimizations or splitting sections of nodes onto different drives.

As I’m writing this out, I’m realizing maybe what I’m asking for is a --group flag?

At the end of the day, I plan on running a lot of these. Granularity makes that easier. It just seems wasteful and like a lot of overhead to pull the status of all the nodes when I only want the status of 10% of them. Even if I manage grouping on my own, I’m still pulling a lot of extra data and causing a lot of overhead for stuff I dont want/need.

Every bit of this is without knowing the internals of the how the status subcommand works. If getting the status of all nodes is a requirement for something to work, IMO now is not the time to change it.

chriso · April 5, 2024, 5:50pm

Thanks for your thoughts.

I don’t really understand what the group would be based on.

It seems like it would be more tedious to me to specify status updates on a per-peer basis rather than just get the whole lot.

chriso · April 5, 2024, 6:14pm

It wouldn’t actually be difficult to add that though, if it was something you really wanted. It just seems a bit odd to me.

I’m glad you brought it up though, because this has got me thinking about defining a scope for what the node manager is. It might be the case that we want to say that the node manager is a fairly simple utility for managing say, up to something like 500 nodes.

You sound like you are going to have some complex requirements that I’m not sure would be typical of many people. For people like you, if you wanted to run say thousands of nodes, I think I would honestly probably recommend looking into using something like Kubernetes. There you would be able to create groups and do all kinds of other complicated things.

aatonnomicc · April 5, 2024, 6:28pm

If your thinking about having a cap on numbers at 500 could we have node 1 numbered as 001 ?

chriso · April 5, 2024, 6:32pm

Well, I wasn’t thinking about a cap as such, but just saying that we make clear that the node manager is something that can be used for managing tens or hundreds of nodes. Then, if you’re a really advanced user and you want to get into managing thousands, and group them by certain kinds of things like CPU usage or whatever, have a look at other existing platforms.

Shu · April 5, 2024, 7:47pm

Not sure where you are referring to for node number as 001, is that part of the --json output?

If so, I would disagree on parameter being outputted a 001 and not 1. I wouldn’t want to add extra logic to any parsers more than necessary.

The grouping can be at a higher level off containers or VMs where some containers or VMs are marked as staging (say 5% of entire fleet) as oppose to production (entire fleet minus 5%), and its up to the user to decide which group here off safenode pids should get the update first etc.

I think adding that kind of grouping into safenode pid at this time adds to the complexity without much benefit for most users.

Personally, I do wish we weren’t creating new services per safenodeX on a host. While it offers isolation now, its pretty heavy on the # of services it will end up creating on the OS etc. One suggestion I would make here would be the safenode manager (or a variant) as service that is always running upon startup, and based on a setting spins up child pids for safenode binaries that run, and you would control the count with a json setting or RPC call or parameter passed into the service. This limits the query to just 1 endpoint and it queries the safenode pids that are child processes of that service or internally keeps track of them at a certain frequency. The user could always query the individual safenode too if required if a RPC parameter is always passed into those services at startup from the higher level service (though KISS applies here too, so maybe its not worth exposing many options here, everything routes through the parent process or the safenode manager service etc).

Alternatively, to be honest, I am not a fan off safenodeX when managing 100s of safenode pids on a Windows or Linux host, it be cleaner if it was systemctl start safenode@1, systemctl start safenode@2 etc where you only have 1 service defined but it allows to take in argument for the safenode service # to spin up so the service, so at /etc/systemd/system is defined only once and is generic. As a vendor here to the OS and owning this product, I can see folks getting upset if they have 100 safenode X services registered on a Windows or Linux OS.

I get it that we are at early stages here, and we have to start somewhere so for now its far better than the older safeup mechanism or spinning them up manually one at a time.

chriso · April 5, 2024, 7:57pm

While that is another reasonable approach, it’s re-write of the entire application. I don’t think we can afford that at this point in time.

It’s a reasonable point, but it’s just really an administration/management issue. I’m pretty sure you still actually end up with X number of services running, doing it that way. I actually did consider that for Systemd, but I’m not sure it was supported by the service-manager-rs crate. I may look into that again. On Windows, I don’t think you can get that kind of management, at least not that I’m aware of.

Shu · April 5, 2024, 8:01pm

Yep, I agree with you, that’s a pretty significant architectural change at this time.

Not a deal breaker by any means at this time. I am okay in using what we have today, .

Yes, while it will create N services or pids, the filesystem will only have 1 generic implementation of the service on disk.

Yes, Windows won’t support this hence the suggestion of parent/child process management (maybe in the distance future, or maybe not at all). Up to MaidSafe to decide later on.

Topic		Replies	Views
Safenode program built from current github will not join the network (got the keys from logs) Support	6	61	December 1, 2024
Using node manager to run a local test network Development	20	615	February 6, 2024
Help with safenode manager on windows Support	10	209	August 15, 2024
Back again after another extended (non-testing) absence . . can't connect . Support	9	99	December 14, 2024
I am trying to join the comnet through powershell and I am getting a wierd error Support	20	907	February 20, 2022

Node Manager UX and Issues

Related topics