Thanks for the clarification here, so you are suggesting to not pass --peer or --network-contacts-url into safenode-manager, and simply default to a safenode built with network-contacts feature, which would auto bootstrap if no --peer or --network-contacts URL argument is provided. Is that right?
For the past many months I have relied on SAFE_PEER env, so I think I may have gotten out of date on what folks are doing as bare minimum to join a testnet. It seems like they were simply running ‘safenode’ as is (release binary built with network-contacts feature), and if they like the defaults, it would do the needful without any extra arguments.
I am all good with the above suggestion, as with so many options earlier, I wasn’t sure what would remain and what would still work when passing certain info between safenode-manager and safenode. Thanks!
Yeah, that’s right. With the network-contacts feature enabled, if you run safenode and don’t specify a peer, it will download the peers from a file that is hosted on S3. I still need to get clarity on this, but my understanding now is that the Early Technical Beta testnet will remain up now. So the peer list will always be the same for that.
However, if we had another testnet running in parallel, we would then need to start using --peer for that testnet, because there is only one contacts file. Or we would need to maybe look into baking the contacts in at build time or something like that.
Having said all that though, I know that you are a much more advanced user, so you might have your own good reasons for still using --peer with the node manager.
Yeah, apologies, a few people pointed this out in the ETB thread. Unfortunately the 0.7.2 release never went up correctly, but I put in a PR to fix it. I may actually just upload the binaries manually.
@chriso - I started using safenode-manager in attempt to repair or restart the safenode services that have 0 connected peers once running.
Long term, if there was also another state that represented DISCONNECTED that maybe useful to see in the --json when the connected peers themselves are exactly at 0 or INITIALIZING etc, but maybe thats not the responsiblity off the manager.
I just feel we could use more ENUM states to track the health off a safenode process in terms of the life cycle: ADDED → RUNNING → INITIALIZING → FETCHING → READY → DISCONNECTED → STOPPED. etc.
Another thing I noticed is when --interval 30000 or whatever maybe is passed on, and say I have 50 nodes managing under the safenode-manager, and the first 10 are started, but the rest were not started (this really happened… and the remaining 40 showed as added only). I wasn’t watching the terminal closely so I don’t have the output if there was an error generated or not, but I only came to know that hte 40 were not started due to a separate ps -ef | grep safenode | wc -l command.
I went back to safenode-manager and typed safenode-manager start --interval 30000 knowing it would refresh the registry and start the safenodes that have been spun up yet. The problem I noticed here is now it takes 30 * 10 = 300 seconds, or 5 minutes to iterate through the running processes that it knows is already running, before doing a delay start attempt on the remaining 40 that were in an ADDED state only.
My suggestion here is if it has a NOP action or state is already running, why wait up to 30 seconds before iterating to the next safenode service? This would speed up ‘bulk’ start/stop actions where actual operation is required to alter state while preserving the staggered delay as requested by --interval parameter, when the user doesn’t want to micro manage the target off that command on a per safenode service name etc.
Happy to hear more thoughts or feedback from you or the community here. Thanks.
Yeah, I agree that would be useful, but it’s more complicated: that’s information about the node itself, rather than the service. The states we have so far represent service states–nothing about the node domain.
This sounds similar to the issue @aatonnomicc encountered. So far I don’t know how to reproduce it.
Right, yeah, I think I understand what you mean here. Good suggestion.
Ha, this just happened again on 2nd attempt of starting all 50 peers, where the first group was 10 that had started, and it attempted a start on the remaining 40:
If I hit safenode-manager status, now it covered 3 out of the 6 that were previously ADDED.
Now only shows 3 left as added though the last pid was already running:
There is some delay here I think, but I guess it self heals (as best as it can) based on the next ‘refresh’ off registry… fascinating… or the RPC endpoint isn’t fully activated within the interval specified… or a timeout against that RPC… so either way, it doesnt have any extra info to update itself?
@aatonnomicc - something I just wrote really quickly in powershell .net core (should work under powershell 7 on windows) and linux as well (not making it overly complicated for now or bullet proof (kept it small)) (Rather wait for more permanent fix to this bootstrap problem ideally from dev team):
It does the work for now on attempting to restart safenodes that are RUNNING and have peers of 0 and didnt start up. One could easily wire it up as a cron job if needed. i.e. pwsh -F safe_restart.ps1
FWIW, once a safenode peer has connected peers > 0, I haven’t seen it go back down to 0 under the lifetime off the pid itself, but I have had to run this multiple times on the same container to get ALL existing RUNNING safenode pids under safenode-manager to be in a connected peer state > 0 & RUNNING.
Note: Its not handling trying to restart the ADDED but not RUNNING state off the pids, which might be a incorrect status update from safenode-manager (TBD), and I didn’t want to spin up more pids than the number expected on the machine at the moment either (although, the service itself won’t allow that (I believe)). However, it would be an easy modification to the snippet off code above (regex modification) to kick start the ADDED state again via safenode-manager binary, but for now going to leave that logic out for now. It also assumes these safe binaries are in your PATH.
PS C:\Users\kyte7\safe> C:\Users\kyte7\safe\safenode-manager start
=================================================
Start Safenode Services
=================================================
Refreshing the node registry...
Attempting to start safenode4...
Failed to start 1 service(s):
←[31m✕←[0m safenode4: [SC] StartService FAILED 1053:
The service did not respond to the start or control request in a timely fashion.
Error:
0: ←[91mFailed to start one or more services←[0m
Location:
←[35msn_node_manager\src\cmd\node.rs←[0m:←[35m421←[0m
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
What am I doing wrong? I assume mostly everything I don’t have a clue what I am doing.
You don’t actually need to use WinSW, you just need to make it available somewhere that is on PATH. The node manager will then use it to generate service definitions.
So you can remove the configuration file–the node manager will generate one of those for each service.
The usage of the node manager is otherwise the same on Windows. So, add, then start etc.
Still getting the timeout, looks like the --peer is not added if you look at he status.
PS C:\Windows\system32> C:\Users\kyte7\safe\safenode-manager add --count 1 --peer /ip4/144.126.194.103/udp/55644/quic-v1/p2p/12D3KooWFSYX9kwZKnsbBn263VpCkW8EvpE6DG7nBinLgF66USTT --node-port 15555 --version 0.105.2
=================================================
Add Safenode Services
=================================================
1 service(s) to be added
Downloading safenode version 0.105.2...
Download completed: C:\Users\kyte7\AppData\Local\Temp\d3df1113-ae39-4ff0-8622-fab447431745\safenode.exe
Services Added:
✓ safenode1
- Safenode path: C:\ProgramData\safenode\data\safenode1\safenode.exe
- Data path: C:\ProgramData\safenode\data\safenode1
- Log path: C:\ProgramData\safenode\logs\safenode1
- RPC port: 127.0.0.1:49993
[!] Note: newly added services have not been started
PS C:\Windows\system32> C:\Users\kyte7\safe\safenode-manager start
=================================================
Start Safenode Services
=================================================
Refreshing the node registry...
Attempting to start safenode1...
Failed to start 1 service(s):
←[31m✕←[0m safenode1: [SC] StartService FAILED 1053:
The service did not respond to the start or control request in a timely fashion.
Error:
0: ←[91mFailed to start one or more services←[0m
Location:
←[35msn_node_manager\src\cmd\node.rs←[0m:←[35m421←[0m
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
PS C:\Windows\system32> C:\Users\kyte7\safe\safenode-manager status
=================================================
Safenode Services
=================================================
Refreshing the node registry...
Service Name Peer ID Status Connected Peers
safenode1 - ←[33mADDED←[0m -
PS C:\Windows\system32> C:\Users\kyte7\safe\safenode-manager add --help
Add one or more safenode services.
It’s not picking up WinSW and rather is attempting to use sc.exe. The node doesn’t work with that.
I’ll need to look into it in more detail, but in theory, if WinSW is on a location that’s in the PATH, it’s supposed to prefer that over sc.exe. I’ll have a look on Tuesday when I’m back. Thanks for your efforts in the mean time.
Btw, have you confirmed this just by opening a Powershell session and typing “winsw”? In Windows, sometimes you need to log out and back in for a path change to take effect.
We use it in our integration tests and I haven’t seen any of the Windows ones fail.
For certain services (depending on their parent service), it requires a full reboot, other processes like ‘explorer.exe’, ‘cmd.exe’ etc will see the new env vars immediately due to a notify message internally being broadcasted by Windows (opening a new process for them).
ProcExplorer64.exe (part of SysInternals Suite) would give you the view in one its tab when you select a pid, in this case powershell.exe, cmd.exe, sc.exe or winsw.exe or safenode-manager.exe etc, whether that service or pid is seeing the new PATH value as part of its current PEB block etc (which it inherits from its parent or OS at time of spin up).
For certain services or existing pids that are running, calling low level Windows API calls like this will immediately make the existing PID or service that’s already running recognize the new environment variables (or may require restart of service but not a reboot of computer) (it depends on what the parent root service is for that pid (C# code below)):
So if ProcExplorer64 isn’t showing the updated PATH as part of the running pid’s env vars, then attempting to refresh its env vars with above code may work, and if not, then ultimately a reboot would be required.
I cannot open it via PowerShell, I am running as administrator if I simply enter WinSW-x64.exe nothing happens, if I run it with Start-Process it pops up briefly then closes.
I can start it via GUI as admin, lost for ideas.