Bash scripts for managing safe nodes on Linux

  • Migrate a node to a different machine
    Rocky Linux 9.4, bash
    Not really tested so just an idea. Some additional configuration is needed to make this work.
  PEER_ID="<peer_id>" # running at port <sn_port>
  mkdir -p $HOME/.local/share/safe/node
  sshpass -p 'password' rsync -avz -e 'ssh -p <port>' --bwlimit=1000 --progress <user>@<host>:/home/<user>/.local/share/safe/node/$PEER_ID $HOME/.local/share/safe/node/
  sshpass -p 'password' rsync -avz -e 'ssh -p <port>' --bwlimit=1000 --progress <user>@<host>:/home/<user>/.local/share/safe/node/$PEER_ID $HOME/.local/share/safe/node/
  sshpass -p 'password' ssh -p <port> <user>@<host> '/home/<user>/snnm -p <sn_port> -t'
  sshpass -p 'password' rsync -avz -e 'ssh -p <port>' --bwlimit=1000 --progress <user>@<host>:/home/<user>/.local/share/safe/node/$PEER_ID $HOME/.local/share/safe/node/
  ./snnm -r
1 Like
  • Remote management via ssh using snnm (v0.1.4)
    Rocky Linux 9.4, bash
    My favorites are remote Termination by port range when CPU seems locked up, and node process stop and restart (eXchange).
  sshpass -p '<password>' ssh -p <port> <user>@<host> 'uptime' #get CPU stats
  sshpass -p '<password>' ssh -p <port> <user>@<host> 'ps -A | grep safenode' #get actually running node PIDs
  sshpass -p '<password>' ssh -p <port> <user>@<host> '/home/<user>/snnm -l' # list cached info
  sshpass -p '<password>' ssh -p <port> <user>@<host> '/home/<user>/snnm -p <sn_port> -t' # terminate by port number
  sshpass -p '<password>' ssh -p <port> <user>@<host> '/home/<user>/snnm -a <sn_port_range_start> -b <sn_port_range_end> -d drirmbda_73081 -x' exchange node process for port range adding my discord ID

1 Like

This python script parses the safenode-manager status command and just constantly runs though to search for nodes with 0 connections, reboots them. You can leave it running permanently. Once complete it checks again every hour.

I can probably be improved but… it works lol.

import subprocess
import time
from datetime import datetime, timedelta

# Dictionary to store the last reboot time for each node
last_reboot_time = {}

def get_safenode_status():
    """Runs the `safenode-manager status` command and returns the output as a list of lines."""
    print("Fetching safenode status...")
    result = subprocess.run(['safenode-manager', 'status'], capture_output=True, text=True)
    print("Status fetched.")
    return result.stdout.splitlines()

def parse_status_output(status_lines):
    """Parses the output from `safenode-manager status` and returns a list of nodes with 0 connections."""
    print("Parsing status output...")
    nodes_with_zero_connections = []
    for line in status_lines:
        print(f"Processing line: {line}")
        parts = line.split()
        if len(parts) < 4:
            print(f"Skipping malformed line: {line}")
            continue  # Skip malformed lines
        node_name = parts[0]
        try:
            connections = int(parts[-1])
        except ValueError:
            print(f"Skipping line due to parsing error: {line}")
            continue
        print(f"Node: {node_name}, Connections: {connections}")
        if connections == 0:
            print(f"Node {node_name} has 0 connections.")
            nodes_with_zero_connections.append(node_name)
    print("Finished parsing status output.")
    return nodes_with_zero_connections

def wait_for_stop(node_name):
    """Waits until the node is fully stopped by checking its status repeatedly."""
    print(f"Waiting for {node_name} to fully stop...")
    while True:
        status_lines = get_safenode_status()
        for line in status_lines:
            if node_name in line:
                if "STOPPED" in line:
                    print(f"{node_name} is now stopped.")
                    return
        print(f"{node_name} is not yet stopped, checking again in 5 seconds...")
        time.sleep(5)  # Check every 5 seconds

def reboot_node(node_name):
    """Reboots a node by stopping and starting the service."""
    print(f"Rebooting {node_name}...")
    stop_command = ['safenode-manager', 'stop', '--service-name', node_name]
    start_command = ['safenode-manager', 'start', '--service-name', node_name]
    
    print(f"Stopping {node_name}...")
    subprocess.run(stop_command)
    print(f"Stopped {node_name}, waiting for it to fully stop...")
    wait_for_stop(node_name)
    print(f"Starting {node_name}...")
    subprocess.run(start_command)
    print(f"Started {node_name}")
    # Update the last reboot time for the node
    last_reboot_time[node_name] = datetime.now()
    print(f"Updated last reboot time for {node_name} to {last_reboot_time[node_name]}.")

def should_reboot(node_name):
    """Checks if the node should be rebooted based on the last reboot time."""
    if node_name not in last_reboot_time:
        print(f"{node_name} has not been rebooted before, should reboot.")
        return True
    if datetime.now() - last_reboot_time[node_name] > timedelta(hours=1):
        print(f"More than an hour has passed since {node_name} was last rebooted, should reboot.")
        return True
    print(f"Less than an hour since {node_name} was last rebooted, skipping reboot.")
    return False

def main():
    print("Starting safenode monitoring script...")
    status_lines = get_safenode_status()
    nodes_to_reboot = parse_status_output(status_lines)
    
    for node_name in nodes_to_reboot:
        if should_reboot(node_name):
            reboot_node(node_name)
        else:
            print(f"Skipping reboot for {node_name} as it was recently rebooted.")
    
    print("Sleeping for one hour before next check...")
    time.sleep(3600)  # Sleep for one hour

if __name__ == "__main__":
    while True:
        main()

2 Likes

So even with my script, I seem to only get up to about 12 nodes and others start dropping out. Is this something to do with my router maybe? I see people discussing this --home-network flag but - what does this actually do? I notice another script with port specification range. I havent opened this on the router, I presumed it was working given having nearly 1000 active connections.

How do we know the incoming ports? Is there guidelines for this?

I am using port forwarding of port ranges to different machines. On those machines I make sure those ports are not blocked by the machine’s firewall (if any). Then I run nodes at specific ports to keep things organized, ports-wise. I have not experience otherwise.

My assumption is that I dont need to open ports as some nodes are connecting… some are always on 0 though… I cant seem to maintain. My guess if it was a port issue it would always be 0 on all of them. Unsure how do I debug this?

They can connect to their peers the same as your browser connects to sites. Normal firewall operation. And so the churn process works and you are given chunks to be stored due to churns. These are not the new chunks you earn from, just existing ones you became responsible for.

The problem is the other way, clients and other nodes cannot contact you because your firewall is blocking unknown incoming packets. You see when you talk to a peer the router NAT remembers that and when a response comes back the router says to that packet you were expected and I will route you to the pc that contacted you. But for unknown incoming then the packet hits a wall and goes splat

Ah gotcha, so il open the ports. How can I modify the ports the already created nodes use?

if you used --node-port then opening those ports should allow then to receive unsolicited packets from clients and other nodes. But the node maybe shunned unknowingly by too many nodes. The logs only record a shun if it receives the message it is shunned.

As always with beta things like this should see a reset first and start over. Unless that is what you are trying to test

  • Moderate traffic peaking from your nodes on your Linux box
    Rocky Linux 9.4, bash
    It appears that shielding my router and ISP from peaks in upload bandwidth helps keeping latency stable and low, and helps lowering router CPU load somewhat and keeping it more stable. Code below inspired by wondershaper is just an example that works somehow but probably can be improved. Will tuning of safe node traffic shaping become a thing?
IFACE="enp1s0"					# interface to apply this to
sudo tc qdisc replace dev "$IFACE" root handle 1: htb default 20
USPEED="200000" 				#200000kbit for 200Mbps ISP upload service
sudo tc class replace dev "$IFACE" parent 1: classid 1:1 htb rate "${USPEED}kbit" prio 5 
RATE=$((20*${USPEED}/100))
if [ "$RATE" -eq 0 ]; then RATE=1 ; fi
sudo tc class add dev "$IFACE" parent 1:1 classid 1:30 htb rate "${RATE}kbit" ceil $((90*${USPEED}/100))kbit prio 3
# source ports range (or replace by list of ports) to cap bandwidth for using class "1:30"
NOPRIOPORTSRC=($(seq 30600 1 30699));		
for sport in "${NOPRIOPORTSRC[@]}"; do
  [ -n "$sport" ] || continue
  sudo tc filter add dev "$IFACE" parent 1: protocol ip prio 15 u32 match ip sport "$sport" 0xffff flowid 1:30;
done
3 Likes

Yeaaah I didnt do that on creation, guess il start again with my open ranges lol.

1 Like

So I wiped all the nodes, ran something like;

safenode-manager add --owner .user --node-port $number --peer /ip4/46.101.80.187/udp/58070/quic-v1/p2p/12D3KooWKgJ>

safenode  320788          maxsan 1768u  IPv4 11245091      0t0  TCP 127.0.0.1:41567 (LISTEN)
safenode  320888          maxsan 1897u  IPv4 11253896      0t0  TCP 127.0.0.1:41711 (LISTEN)

These nodes though are not the correct ports… what am I missing here? The ports still seem random. They are not matched to the loop value at all.

First thought is that I wish vdash had a non-interactive mode, just analyzing the logs and spitting out findings. (@happybeing)
Second, I have not been successful accessing the RPC interface from the cli, to query node status directly.
Third, I think that the nodes should probably report node health metrics without necessarily needed another app.

To keep people motivated to run nodes, it needs to be clear if they are even contributing to the network, at least, and longer term, if the node gets rewarded properly for the resources it is taking.

Final point: As soon as someone runs a node on a “spare” resource, it reduces the spare resources left, which certainly is a cost to the owner of those resources. Spare resources have value even if they remain unused. This changes only if a node can get out of the way when the spare resource becomes needed for something else. Currently, that requires constant monitoring, and in the future the node would need to sense that it is causing trouble, which would require sensing disk utilization, CPU utilization, router latency, bandwidth, router CPU, and quickly moderating activity, bandwidth shaping, or just shutting down.

2 Likes

I successfully used grpcurl (on github) to talk to the RPC interface using cli.

2 Likes

I tried and got stuck. I would appreciate if someone (or you) could give one working example using grpcurl.

1 Like

You need the prototype files. Search out that and follow the discussions on it when I was tying to access the RPC calls. The prototype files are in josheuf’s git hub I think.

I haven’t got time at the moment to go back and look or work it out again. Maybe later on. I didn’t get around to writing scripts for it unfortunately. But I know you do not need to compile the prototype files.

1 Like

I got the two proto files already, but got stuck there. Anyways, if you have time, please do help.

1 Like

I went back through my cli history on the terminal and I am retyping this as its on another machine so watch out for typos.

grpcurl -plaintext -proto safenode.proto 127.0.0.1:38541 safenode_proto.SafeNode/KBuckets

I hope thats right, I did have to do a couple of tries and this is the last one so should be good. Change the rpc port to one of yours and gives you the KBuckets

EDIT: Just tried it on a node and its rpc port (use safenode-manager status --details to get the rpc port numbers) and it gave me the KBuckets

1 Like

Thanks so much! I will try it as soon as I get back to my terminal.