Node Manager UX and Issues

Shu · March 29, 2024, 9:59pm

Thanks for the clarification here, so you are suggesting to not pass --peer or --network-contacts-url into safenode-manager, and simply default to a safenode built with network-contacts feature, which would auto bootstrap if no --peer or --network-contacts URL argument is provided. Is that right?

For the past many months I have relied on SAFE_PEER env, so I think I may have gotten out of date on what folks are doing as bare minimum to join a testnet. It seems like they were simply running ‘safenode’ as is (release binary built with network-contacts feature), and if they like the defaults, it would do the needful without any extra arguments.

I am all good with the above suggestion, as with so many options earlier, I wasn’t sure what would remain and what would still work when passing certain info between safenode-manager and safenode. Thanks!

chriso · March 29, 2024, 10:07pm

Yeah, that’s right. With the network-contacts feature enabled, if you run safenode and don’t specify a peer, it will download the peers from a file that is hosted on S3. I still need to get clarity on this, but my understanding now is that the Early Technical Beta testnet will remain up now. So the peer list will always be the same for that.

However, if we had another testnet running in parallel, we would then need to start using --peer for that testnet, because there is only one contacts file. Or we would need to maybe look into baking the contacts in at build time or something like that.

Having said all that though, I know that you are a much more advanced user, so you might have your own good reasons for still using --peer with the node manager.

Msafe · March 30, 2024, 10:19am

safeup node-manager
                               *
Installing safenode-manager *
                               *
Installing safenode-manager for x86_64-unknown-linux-musl at /root/.local/bin…
Retrieving latest version for safenode-manager…
Installing safenode-manager version 0.7.2…
Error: Release binary https://sn-node-manager.s3.eu-west-2.amazonaws.com/safenode-manager-0.7.2-x86_64-unknown-linux-musl.tar.gz was not found

Location:
src/install.rs:324:24

safeup --version

safeup 0.7.0

I think safeup has broken as install option

chriso · March 30, 2024, 10:37am

Yeah, apologies, a few people pointed this out in the ETB thread. Unfortunately the 0.7.2 release never went up correctly, but I put in a PR to fix it. I may actually just upload the binaries manually.

Shu · March 30, 2024, 8:10pm

@chriso - I started using safenode-manager in attempt to repair or restart the safenode services that have 0 connected peers once running.

Long term, if there was also another state that represented DISCONNECTED that maybe useful to see in the --json when the connected peers themselves are exactly at 0 or INITIALIZING etc, but maybe thats not the responsiblity off the manager.

I just feel we could use more ENUM states to track the health off a safenode process in terms of the life cycle: ADDED → RUNNING → INITIALIZING → FETCHING → READY → DISCONNECTED → STOPPED. etc.

Another thing I noticed is when --interval 30000 or whatever maybe is passed on, and say I have 50 nodes managing under the safenode-manager, and the first 10 are started, but the rest were not started (this really happened… and the remaining 40 showed as added only). I wasn’t watching the terminal closely so I don’t have the output if there was an error generated or not, but I only came to know that hte 40 were not started due to a separate ps -ef | grep safenode | wc -l command.

I went back to safenode-manager and typed safenode-manager start --interval 30000 knowing it would refresh the registry and start the safenodes that have been spun up yet. The problem I noticed here is now it takes 30 * 10 = 300 seconds, or 5 minutes to iterate through the running processes that it knows is already running, before doing a delay start attempt on the remaining 40 that were in an ADDED state only.

My suggestion here is if it has a NOP action or state is already running, why wait up to 30 seconds before iterating to the next safenode service? This would speed up ‘bulk’ start/stop actions where actual operation is required to alter state while preserving the staggered delay as requested by --interval parameter, when the user doesn’t want to micro manage the target off that command on a per safenode service name etc.

Happy to hear more thoughts or feedback from you or the community here. Thanks.

chriso · March 30, 2024, 8:17pm

Yeah, I agree that would be useful, but it’s more complicated: that’s information about the node itself, rather than the service. The states we have so far represent service states–nothing about the node domain.

This sounds similar to the issue @aatonnomicc encountered. So far I don’t know how to reproduce it.

Right, yeah, I think I understand what you mean here. Good suggestion.

Shu · March 30, 2024, 8:19pm

Ha, this just happened again on 2nd attempt of starting all 50 peers, where the first group was 10 that had started, and it attempted a start on the remaining 40:

Note: It says pid is running but status shows as ‘added’ only:

If I hit safenode-manager status, now it covered 3 out of the 6 that were previously ADDED.

Now only shows 3 left as added though the last pid was already running:

There is some delay here I think, but I guess it self heals (as best as it can) based on the next ‘refresh’ off registry… fascinating… or the RPC endpoint isn’t fully activated within the interval specified… or a timeout against that RPC… so either way, it doesnt have any extra info to update itself?

chriso · March 30, 2024, 8:19pm

Yeah, thanks. It’s the same issue that neik has seen. If I could reproduce it I could investigate, but so far I haven’t been able to.

Shu · March 30, 2024, 9:44pm

@aatonnomicc - something I just wrote really quickly in powershell .net core (should work under powershell 7 on windows) and linux as well (not making it overly complicated for now or bullet proof (kept it small)) (Rather wait for more permanent fix to this bootstrap problem ideally from dev team):

PS > cat ./safe_restart.ps1

$services = safenode-manager status
$servicesRegex = '^(safenode\d+)\s+\w{52}\s+(RUNNING)\s+(0)$'
$lines = $services.Split([System.Environment]::NewLine)

foreach($line in $lines)
{
	if ($line -match $servicesRegex)
	{
		write-host $Matches[0]
		safenode-manager stop --service-name $Matches[1]
		safenode-manager start --service-name $Matches[1]
		start-sleep -seconds 30
	}
}

It does the work for now on attempting to restart safenodes that are RUNNING and have peers of 0 and didnt start up. One could easily wire it up as a cron job if needed. i.e. pwsh -F safe_restart.ps1

FWIW, once a safenode peer has connected peers > 0, I haven’t seen it go back down to 0 under the lifetime off the pid itself, but I have had to run this multiple times on the same container to get ALL existing RUNNING safenode pids under safenode-manager to be in a connected peer state > 0 & RUNNING.

Note: Its not handling trying to restart the ADDED but not RUNNING state off the pids, which might be a incorrect status update from safenode-manager (TBD), and I didn’t want to spin up more pids than the number expected on the machine at the moment either (although, the service itself won’t allow that (I believe)). However, it would be an easy modification to the snippet off code above (regex modification) to kick start the ADDED state again via safenode-manager binary, but for now going to leave that logic out for now. It also assumes these safe binaries are in your PATH.

aatonnomicc · March 30, 2024, 10:09pm

Thanks @Shu I’ll give that a go this has been driving me round the twist

I’ll hopefully get to try it out tomorrow after an Easter barbecue.

I’m glad you are in the case and can hopefully get @chriso something useful information to work with

anon26713768 · March 31, 2024, 3:49pm

@chriso
I added the WinSW.exe to Users\<user>\safe
included a configuration.xml with

<service>
    <id>NodeService</id>
    <name>Node Service</name>
    <description>Autonomi service.</description>
    <executable>C:\Users\kyte7\safe\safenode-manager</executable>
</service>

installed and then started the service.

Now when trying to start the node I get this

PS C:\Users\kyte7\safe> C:\Users\kyte7\safe\safenode-manager start
=================================================
             Start Safenode Services
=================================================
Refreshing the node registry...
Attempting to start safenode4...
Failed to start 1 service(s):
←[31m✕←[0m safenode4: [SC] StartService FAILED 1053:

The service did not respond to the start or control request in a timely fashion.


Error:
   0: ←[91mFailed to start one or more services←[0m

Location:
   ←[35msn_node_manager\src\cmd\node.rs←[0m:←[35m421←[0m

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

What am I doing wrong? I assume mostly everything I don’t have a clue what I am doing.

chriso · March 31, 2024, 4:36pm

Hey Josh,

You don’t actually need to use WinSW, you just need to make it available somewhere that is on PATH. The node manager will then use it to generate service definitions.

So you can remove the configuration file–the node manager will generate one of those for each service.

The usage of the node manager is otherwise the same on Windows. So, add, then start etc.

anon26713768 · March 31, 2024, 6:27pm

Still getting the timeout, looks like the --peer is not added if you look at he status.

PS C:\Windows\system32> C:\Users\kyte7\safe\safenode-manager add --count 1 --peer /ip4/144.126.194.103/udp/55644/quic-v1/p2p/12D3KooWFSYX9kwZKnsbBn263VpCkW8EvpE6DG7nBinLgF66USTT --node-port 15555 --version 0.105.2
=================================================
              Add Safenode Services
=================================================
1 service(s) to be added
Downloading safenode version 0.105.2...
Download completed: C:\Users\kyte7\AppData\Local\Temp\d3df1113-ae39-4ff0-8622-fab447431745\safenode.exe
Services Added:
 ✓ safenode1
    - Safenode path: C:\ProgramData\safenode\data\safenode1\safenode.exe
    - Data path: C:\ProgramData\safenode\data\safenode1
    - Log path: C:\ProgramData\safenode\logs\safenode1
    - RPC port: 127.0.0.1:49993
[!] Note: newly added services have not been started
PS C:\Windows\system32> C:\Users\kyte7\safe\safenode-manager start
=================================================
             Start Safenode Services
=================================================
Refreshing the node registry...
Attempting to start safenode1...
Failed to start 1 service(s):
←[31m✕←[0m safenode1: [SC] StartService FAILED 1053:

The service did not respond to the start or control request in a timely fashion.


Error:
   0: ←[91mFailed to start one or more services←[0m

Location:
   ←[35msn_node_manager\src\cmd\node.rs←[0m:←[35m421←[0m

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
PS C:\Windows\system32> C:\Users\kyte7\safe\safenode-manager status
=================================================
                Safenode Services
=================================================
Refreshing the node registry...
Service Name       Peer ID                                              Status  Connected Peers
safenode1          -                                                    ←[33mADDED←[0m               -
PS C:\Windows\system32> C:\Users\kyte7\safe\safenode-manager add --help
Add one or more safenode services.

chriso · March 31, 2024, 6:33pm

It’s not picking up WinSW and rather is attempting to use sc.exe. The node doesn’t work with that.

I’ll need to look into it in more detail, but in theory, if WinSW is on a location that’s in the PATH, it’s supposed to prefer that over sc.exe. I’ll have a look on Tuesday when I’m back. Thanks for your efforts in the mean time.

wes · March 31, 2024, 7:55pm

I was also facing this issue on one of my windows machines. I will test out this solution early this week.

anon26713768 · March 31, 2024, 8:00pm

If you figure it out before Tuesday please post. I have WinSW on PATH so that doesn’t seem to be the issue.

chriso · March 31, 2024, 8:45pm

Btw, have you confirmed this just by opening a Powershell session and typing “winsw”? In Windows, sometimes you need to log out and back in for a path change to take effect.

We use it in our integration tests and I haven’t seen any of the Windows ones fail.

The code only defaults to sc.exe if it can’t find winsw.exe: service-manager-rs/src/kind.rs at 2f8934c4408882a692e520a59894fb6eba2988a7 · chipsenkbeil/service-manager-rs · GitHub

Shu · March 31, 2024, 8:58pm

For certain services (depending on their parent service), it requires a full reboot, other processes like ‘explorer.exe’, ‘cmd.exe’ etc will see the new env vars immediately due to a notify message internally being broadcasted by Windows (opening a new process for them).

ProcExplorer64.exe (part of SysInternals Suite) would give you the view in one its tab when you select a pid, in this case powershell.exe, cmd.exe, sc.exe or winsw.exe or safenode-manager.exe etc, whether that service or pid is seeing the new PATH value as part of its current PEB block etc (which it inherits from its parent or OS at time of spin up).

For certain services or existing pids that are running, calling low level Windows API calls like this will immediately make the existing PID or service that’s already running recognize the new environment variables (or may require restart of service but not a reboot of computer) (it depends on what the parent root service is for that pid (C# code below)):

  [DllImport( "user32.dll", SetLastError = true, CharSet = CharSet.Auto )]
  static extern bool SendNotifyMessage( IntPtr hWnd, uint Msg,UIntPtr wParam, string lParam );

   public static void NotifyUserEnvironmentVariableChanged()
   {
      const int HWND_BROADCAST = 0xffff;
      const uint WM_SETTINGCHANGE = 0x001a;
      SendNotifyMessage( ( IntPtr )HWND_BROADCAST, WM_SETTINGCHANGE, ( UIntPtr )0, "Environment" );
   }

So if ProcExplorer64 isn’t showing the updated PATH as part of the running pid’s env vars, then attempting to refresh its env vars with above code may work, and if not, then ultimately a reboot would be required.

chriso · March 31, 2024, 9:01pm

Thanks for the info. During the integration test it doesn’t require a reboot. WinSW gets downloaded and the location is appended to the path.

anon26713768 · March 31, 2024, 11:25pm

I cannot open it via PowerShell, I am running as administrator if I simply enter WinSW-x64.exe nothing happens, if I run it with Start-Process it pops up briefly then closes.
I can start it via GUI as admin, lost for ideas.

Topic		Replies	Views
Bash scripts for managing safe nodes on Linux Community	96	1606	July 7, 2024
Assemble at the Start Line. The Beta is About to Begin Updates	633	6102	July 7, 2024
Beta Rewards Network Launch Wed 12GMT — Here's what you need to do! Updates	441	4551	August 15, 2024
BasicEconomyTweaks [Early Technical Beta] [OFFLINE - see new beta test - part deux] Releases	606	5809	April 13, 2024
Node-launchpad TUI Ant-Node (was Safe-node)	38	818	June 1, 2024

Node Manager UX and Issues

safeup node-manager

safeup --version

Related topics