Without totally rewriting the code, it seems that you could use safenode-manager just once. The node manager output is a great resource that re-implementing seems counter productive.
In the code that I’m still using:
peer_id=$(safenode-manager status --details | grep -A 5 "$dir_name - RUNNING" | grep "Peer ID:" | awk '{print $3}')
Is in the for loop. I’m no bash guru by any means, but couldn’t you do something like:
node-overview = $(safenode-manager status --details)
*in for loop*
peer_id=$(node-overview | grep -A 5 "$dir_name - RUNNING" | grep "Peer ID:" | awk '{print $3}')
[no idea if that code works, but hopefully you get the idea]
That way you’re not hitting the node status 50-60-100 times for the larger node setups? Yes, the status may be a minute or two out of date once it gets to the end nodes, but seems it would be much less overhead and stress on the node manager.
nice one @wes new version updated as you recommended only slight change is it outputs the safenode-manager status to file then greps it for the individual details
@chriso see the discrepancy here, between the versions safeup update finds and what safenode --version returns.
I get what’s going on here but doesn’t seem intuitive.
PS C:\Users\kyte7> safeup update
**************************************
* *
* Updating components *
* *
**************************************
Retrieving latest version for safe...
Latest version of safe is 0.90.4-alpha.5
There is no previous installation for the safe component.
No update will be performed.
Retrieving latest version for safenode...
Latest version of safenode is 0.105.6-alpha.4
There is no previous installation for the safenode component.
No update will be performed.
Retrieving latest version for safenode-manager...
Latest version of safenode-manager is 0.7.4-alpha.1
There is no previous installation for the safenode-manager component.
No update will be performed.
PS C:\Users\kyte7> safenode --version
safenode cli 0.105.3
PS C:\Users\kyte7> safe --version
sn_cli 0.90.2
Yeah, the semantic version specification won’t consider those alpha versions to be newer.
What I think we need to do is to incorporate the release channel into safeup, and probably the node manager too. So it would have something like a --channel argument, where you would use alpha as the value.
The “can’t run a bunch of nodes through safenode-manager” has been at least partially cracked.
The original TIG stack script had a cron that ran every 5 minutes to pull data about the nodes. Neik has upped it to 15 (I had not) and saw some improvement.
The people running “a lot of nodes” of course need ( )a dashboard to be able to see what happening with the nodes / system.
The logging script had
safenode-manager status --details
inside the node loop to pull out the data. This is a non-issue for small node setups and a perfectly logical way to do it.
As the number of nodes grows, it takes longer and longer to run the loop and in my 5-minute case, it looped over itself and was running 3 times concurrently.
(the following is my assumption)
The safenode-manager status command starts off with “refreshing node registry”. I haven’t looked into the code, but I have a sneaky suspicion refresh includes writing. This would explain my corrupt json file It would also explain my inability to add more than 50 nodes because the file keeps getting overwritten by the looped “refresh” of the status command.
I had the idea to cache the status output and no idea how to implement (not a bash guy) but @aatonnomiccto the rescue.
This gets the node registry to stop “refreshing” constantly and allow proper writes. I have successfully added nodes 51-60 into the alpha network using safenode-manager add / start.
TLDR; If you have a lot of nodes and are having issues, and using some form of logging / status script, make sure it’s not overusing safenode-manger status. Status is a “heavy” operation as your node count increases.
The status command does a refresh because we want to know the status of the nodes as they actually are now. The problem is, a user can interfere with things the node manager is keeping track of, by manually killing processes or services. The OS service infrastructure can also restart the services if they die, which means a new PID will be assigned to the process.
If it’s taking an excessively long amount of time, this refresh idea may have to be revisited somehow.
I don’t think that on its own it takes too long fully agree that the now information is needed for the status.
real 0m5.907s
user 0m3.950s
sys 0m14.311s
At 50 nodes with the script running every 5 minutes the script execution time is the same as the cron time because the status call was in the loop for every node.
6sec * 50 nodes = 300 sec / 60 = 5 minutes.
My assumption is that at the 15 minute cron with the old script would have topped out at ~100 nodes (time above is just status, not the rest of the script)
All of that to say, I don’t think there’s an issue with the “status” run time length or the node manager implementation. I think this was a user level script that we (those running a lot of nodes) all happened to be running that was causing issues as the number of nodes on the system grew.
If we’re going to go down that route, I personally would rather see the status command take a --peer argument (or multiple --peer arguments) than have it automate its fetching scheme with some rigid grouping.
We’re designing this to run on almost anything, but there will absolutely be those of us running quite a few.
It would allow us to section off our own nodes into groups. Say an update comes out and I only want to update 10% of them to make sure they’re good before rolling out the rest of the update I can automate checking those nodes. Not a big deal when running 50 nodes, but what if I’m running 500 nodes? That’s a lot of excess querying to stat 50 nodes.
Same grouping idea for other types of network optimizations or splitting sections of nodes onto different drives.
As I’m writing this out, I’m realizing maybe what I’m asking for is a --group flag?
At the end of the day, I plan on running a lot of these. Granularity makes that easier. It just seems wasteful and like a lot of overhead to pull the status of all the nodes when I only want the status of 10% of them. Even if I manage grouping on my own, I’m still pulling a lot of extra data and causing a lot of overhead for stuff I dont want/need.
Every bit of this is without knowing the internals of the how the status subcommand works. If getting the status of all nodes is a requirement for something to work, IMO now is not the time to change it.
It wouldn’t actually be difficult to add that though, if it was something you really wanted. It just seems a bit odd to me.
I’m glad you brought it up though, because this has got me thinking about defining a scope for what the node manager is. It might be the case that we want to say that the node manager is a fairly simple utility for managing say, up to something like 500 nodes.
You sound like you are going to have some complex requirements that I’m not sure would be typical of many people. For people like you, if you wanted to run say thousands of nodes, I think I would honestly probably recommend looking into using something like Kubernetes. There you would be able to create groups and do all kinds of other complicated things.
Well, I wasn’t thinking about a cap as such, but just saying that we make clear that the node manager is something that can be used for managing tens or hundreds of nodes. Then, if you’re a really advanced user and you want to get into managing thousands, and group them by certain kinds of things like CPU usage or whatever, have a look at other existing platforms.
Not sure where you are referring to for node number as 001, is that part of the --json output?
If so, I would disagree on parameter being outputted a 001 and not 1. I wouldn’t want to add extra logic to any parsers more than necessary.
The grouping can be at a higher level off containers or VMs where some containers or VMs are marked as staging (say 5% of entire fleet) as oppose to production (entire fleet minus 5%), and its up to the user to decide which group here off safenode pids should get the update first etc.
I think adding that kind of grouping into safenode pid at this time adds to the complexity without much benefit for most users.
Personally, I do wish we weren’t creating new services per safenodeX on a host. While it offers isolation now, its pretty heavy on the # of services it will end up creating on the OS etc. One suggestion I would make here would be the safenode manager (or a variant) as service that is always running upon startup, and based on a setting spins up child pids for safenode binaries that run, and you would control the count with a json setting or RPC call or parameter passed into the service. This limits the query to just 1 endpoint and it queries the safenode pids that are child processes of that service or internally keeps track of them at a certain frequency. The user could always query the individual safenode too if required if a RPC parameter is always passed into those services at startup from the higher level service (though KISS applies here too, so maybe its not worth exposing many options here, everything routes through the parent process or the safenode manager service etc).
Alternatively, to be honest, I am not a fan off safenodeX when managing 100s of safenode pids on a Windows or Linux host, it be cleaner if it was systemctl start safenode@1, systemctl start safenode@2 etc where you only have 1 service defined but it allows to take in argument for the safenode service # to spin up so the service, so at /etc/systemd/system is defined only once and is generic. As a vendor here to the OS and owning this product, I can see folks getting upset if they have 100 safenode X services registered on a Windows or Linux OS.
I get it that we are at early stages here, and we have to start somewhere so for now its far better than the older safeup mechanism or spinning them up manually one at a time.
While that is another reasonable approach, it’s re-write of the entire application. I don’t think we can afford that at this point in time.
It’s a reasonable point, but it’s just really an administration/management issue. I’m pretty sure you still actually end up with X number of services running, doing it that way. I actually did consider that for Systemd, but I’m not sure it was supported by the service-manager-rs crate. I may look into that again. On Windows, I don’t think you can get that kind of management, at least not that I’m aware of.
Yep, I agree with you, that’s a pretty significant architectural change at this time.
Not a deal breaker by any means at this time. I am okay in using what we have today, .
Yes, while it will create N services or pids, the filesystem will only have 1 generic implementation of the service on disk.
Yes, Windows won’t support this hence the suggestion of parent/child process management (maybe in the distance future, or maybe not at all). Up to MaidSafe to decide later on.