Migrate a node to a different machine
Rocky Linux 9.4, bash
Not really tested so just an idea. Some additional configuration is needed to make this work.
Remote management via ssh using snnm (v0.1.4)
Rocky Linux 9.4, bash
My favorites are remote Termination by port range when CPU seems locked up, and node process stop and restart (eXchange).
sshpass -p '<password>' ssh -p <port> <user>@<host> 'uptime' #get CPU stats
sshpass -p '<password>' ssh -p <port> <user>@<host> 'ps -A | grep safenode' #get actually running node PIDs
sshpass -p '<password>' ssh -p <port> <user>@<host> '/home/<user>/snnm -l' # list cached info
sshpass -p '<password>' ssh -p <port> <user>@<host> '/home/<user>/snnm -p <sn_port> -t' # terminate by port number
sshpass -p '<password>' ssh -p <port> <user>@<host> '/home/<user>/snnm -a <sn_port_range_start> -b <sn_port_range_end> -d drirmbda_73081 -x' exchange node process for port range adding my discord ID
This python script parses the safenode-manager status command and just constantly runs though to search for nodes with 0 connections, reboots them. You can leave it running permanently. Once complete it checks again every hour.
I can probably be improved but… it works lol.
import subprocess
import time
from datetime import datetime, timedelta
# Dictionary to store the last reboot time for each node
last_reboot_time = {}
def get_safenode_status():
"""Runs the `safenode-manager status` command and returns the output as a list of lines."""
print("Fetching safenode status...")
result = subprocess.run(['safenode-manager', 'status'], capture_output=True, text=True)
print("Status fetched.")
return result.stdout.splitlines()
def parse_status_output(status_lines):
"""Parses the output from `safenode-manager status` and returns a list of nodes with 0 connections."""
print("Parsing status output...")
nodes_with_zero_connections = []
for line in status_lines:
print(f"Processing line: {line}")
parts = line.split()
if len(parts) < 4:
print(f"Skipping malformed line: {line}")
continue # Skip malformed lines
node_name = parts[0]
try:
connections = int(parts[-1])
except ValueError:
print(f"Skipping line due to parsing error: {line}")
continue
print(f"Node: {node_name}, Connections: {connections}")
if connections == 0:
print(f"Node {node_name} has 0 connections.")
nodes_with_zero_connections.append(node_name)
print("Finished parsing status output.")
return nodes_with_zero_connections
def wait_for_stop(node_name):
"""Waits until the node is fully stopped by checking its status repeatedly."""
print(f"Waiting for {node_name} to fully stop...")
while True:
status_lines = get_safenode_status()
for line in status_lines:
if node_name in line:
if "STOPPED" in line:
print(f"{node_name} is now stopped.")
return
print(f"{node_name} is not yet stopped, checking again in 5 seconds...")
time.sleep(5) # Check every 5 seconds
def reboot_node(node_name):
"""Reboots a node by stopping and starting the service."""
print(f"Rebooting {node_name}...")
stop_command = ['safenode-manager', 'stop', '--service-name', node_name]
start_command = ['safenode-manager', 'start', '--service-name', node_name]
print(f"Stopping {node_name}...")
subprocess.run(stop_command)
print(f"Stopped {node_name}, waiting for it to fully stop...")
wait_for_stop(node_name)
print(f"Starting {node_name}...")
subprocess.run(start_command)
print(f"Started {node_name}")
# Update the last reboot time for the node
last_reboot_time[node_name] = datetime.now()
print(f"Updated last reboot time for {node_name} to {last_reboot_time[node_name]}.")
def should_reboot(node_name):
"""Checks if the node should be rebooted based on the last reboot time."""
if node_name not in last_reboot_time:
print(f"{node_name} has not been rebooted before, should reboot.")
return True
if datetime.now() - last_reboot_time[node_name] > timedelta(hours=1):
print(f"More than an hour has passed since {node_name} was last rebooted, should reboot.")
return True
print(f"Less than an hour since {node_name} was last rebooted, skipping reboot.")
return False
def main():
print("Starting safenode monitoring script...")
status_lines = get_safenode_status()
nodes_to_reboot = parse_status_output(status_lines)
for node_name in nodes_to_reboot:
if should_reboot(node_name):
reboot_node(node_name)
else:
print(f"Skipping reboot for {node_name} as it was recently rebooted.")
print("Sleeping for one hour before next check...")
time.sleep(3600) # Sleep for one hour
if __name__ == "__main__":
while True:
main()
So even with my script, I seem to only get up to about 12 nodes and others start dropping out. Is this something to do with my router maybe? I see people discussing this --home-network flag but - what does this actually do? I notice another script with port specification range. I havent opened this on the router, I presumed it was working given having nearly 1000 active connections.
I am using port forwarding of port ranges to different machines. On those machines I make sure those ports are not blocked by the machine’s firewall (if any). Then I run nodes at specific ports to keep things organized, ports-wise. I have not experience otherwise.
My assumption is that I dont need to open ports as some nodes are connecting… some are always on 0 though… I cant seem to maintain. My guess if it was a port issue it would always be 0 on all of them. Unsure how do I debug this?
They can connect to their peers the same as your browser connects to sites. Normal firewall operation. And so the churn process works and you are given chunks to be stored due to churns. These are not the new chunks you earn from, just existing ones you became responsible for.
The problem is the other way, clients and other nodes cannot contact you because your firewall is blocking unknown incoming packets. You see when you talk to a peer the router NAT remembers that and when a response comes back the router says to that packet you were expected and I will route you to the pc that contacted you. But for unknown incoming then the packet hits a wall and goes splat
if you used --node-port then opening those ports should allow then to receive unsolicited packets from clients and other nodes. But the node maybe shunned unknowingly by too many nodes. The logs only record a shun if it receives the message it is shunned.
As always with beta things like this should see a reset first and start over. Unless that is what you are trying to test
Moderate traffic peaking from your nodes on your Linux box
Rocky Linux 9.4, bash
It appears that shielding my router and ISP from peaks in upload bandwidth helps keeping latency stable and low, and helps lowering router CPU load somewhat and keeping it more stable. Code below inspired by wondershaper is just an example that works somehow but probably can be improved. Will tuning of safe node traffic shaping become a thing?
IFACE="enp1s0" # interface to apply this to
sudo tc qdisc replace dev "$IFACE" root handle 1: htb default 20
USPEED="200000" #200000kbit for 200Mbps ISP upload service
sudo tc class replace dev "$IFACE" parent 1: classid 1:1 htb rate "${USPEED}kbit" prio 5
RATE=$((20*${USPEED}/100))
if [ "$RATE" -eq 0 ]; then RATE=1 ; fi
sudo tc class add dev "$IFACE" parent 1:1 classid 1:30 htb rate "${RATE}kbit" ceil $((90*${USPEED}/100))kbit prio 3
# source ports range (or replace by list of ports) to cap bandwidth for using class "1:30"
NOPRIOPORTSRC=($(seq 30600 1 30699));
for sport in "${NOPRIOPORTSRC[@]}"; do
[ -n "$sport" ] || continue
sudo tc filter add dev "$IFACE" parent 1: protocol ip prio 15 u32 match ip sport "$sport" 0xffff flowid 1:30;
done
First thought is that I wish vdash had a non-interactive mode, just analyzing the logs and spitting out findings. (@happybeing)
Second, I have not been successful accessing the RPC interface from the cli, to query node status directly.
Third, I think that the nodes should probably report node health metrics without necessarily needed another app.
To keep people motivated to run nodes, it needs to be clear if they are even contributing to the network, at least, and longer term, if the node gets rewarded properly for the resources it is taking.
Final point: As soon as someone runs a node on a “spare” resource, it reduces the spare resources left, which certainly is a cost to the owner of those resources. Spare resources have value even if they remain unused. This changes only if a node can get out of the way when the spare resource becomes needed for something else. Currently, that requires constant monitoring, and in the future the node would need to sense that it is causing trouble, which would require sensing disk utilization, CPU utilization, router latency, bandwidth, router CPU, and quickly moderating activity, bandwidth shaping, or just shutting down.
You need the prototype files. Search out that and follow the discussions on it when I was tying to access the RPC calls. The prototype files are in josheuf’s git hub I think.
I haven’t got time at the moment to go back and look or work it out again. Maybe later on. I didn’t get around to writing scripts for it unfortunately. But I know you do not need to compile the prototype files.
I hope thats right, I did have to do a couple of tries and this is the last one so should be good. Change the rpc port to one of yours and gives you the KBuckets
EDIT: Just tried it on a node and its rpc port (use safenode-manager status --details to get the rpc port numbers) and it gave me the KBuckets