We are thrilled to announce the upcoming launch of our new testnet, which builds upon the success of our popular NatNet. This testnet, codenamed ReplicationNet, is set to explore the exciting field of Data Replication and Health Metrics. We’ll be concentrating on analysing the network health with considerable data load.
Our prior data storage and replication system relied on the libp2p::record
method. Although this method has simplified network discovery and routing to a great extent, the record replication’s inherent nature broadcasts all records to its CLOSE_GROUP_SIZE
peers. This broadcasting can trigger a data storm during churn, potentially leading to network congestion as data volume increases.
To overcome these challenges, we have implemented targeted replication over the libp2p::record facility. Unlike previous methods that broadcast to all CLOSE_GROUP_SIZE
peers, our updated approach seeks out missing data amongst the CLOSE_GROUP_SIZE
, but only requests data once (unless there was an issue). This strategy significantly reduces network traffic and has been locally proven to perform with far lower memory and CPU usage, while increasing resilience.
The primary objectives of the ReplicationNet testnet are to:
- Verify the replication method functions as intended, minimising data loss.
- Gather network/process health metrics for a more comprehensive understanding of node performance.
- Provide a baseline for further improvements to data replication flows
Participation guidelines
If you are interested in participating, we kindly request that you:
- Use a client to upload files. (The client can be found here.)
- If possible, share the uploaded files with the community by giving out the name-address.
- Periodically download the shared files to verify they’re retained by the network.
- Run a node from a cloud VM (home nodes will likely fail and be closed down) (Nodes can also be found here.)
- Keep the node running for as long as possible.
- Share the health metric logs of your node.
Connecting to ReplicationNet
To join ReplicationNet
with a cloud node, set the network address using the SAFE_PEERS
env variable or use the --peer=
argument. You can use any of the following addresses to connect to the network:
export SAFE_PEERS="/ip4/142.93.33.47/tcp/43691/p2p/12D3KooWATuSWjt61DUoqhVmHXpfenMnqMoPmrhnREQkZ25xrDGQ"
# fall-back addresses
/ip4/165.227.231.8/tcp/43103/p2p/12D3KooWByBQokh2D8Y7ATzXVCzKGiN5g4L8GKXBmBoZhAVeW4At
/ip4/165.232.106.150/tcp/41989/p2p/12D3KooWRYaapMU4i4zwNT3zQhcYPrkCd5QoHuLe7Z4PfGs3hQj5
/ip4/165.22.125.99/tcp/43473/p2p/12D3KooWAd3oubsP5yqU3gBnPw27zDqSZu4LW1F1hqkoojXBL7Rd
/ip4/165.22.119.173/tcp/38933/p2p/12D3KooWAvDMcv39DDsNd8kFSEMCc3cx5ye5VjYFhe43RRLw63rz
Initial network
We have 100 droplets running 2001 nodes in total. (1 droplet having 2vcpu2gb of mem). We haven’t run a network this large with the community (at least not for very long), so this will be testing out the testnet tool to some degree too!
Using the Client
To put/get files you’ll need to use the safe
client. Which you can grab for your platform from GitHub . Once you have the client, you need to either set the SAFE_PEERS
environmental variable or use the --peer=
argument with any of the above network addresses.
Now to upload a directory/file to the network, use the following command:
# using the SAFE_PEERS variable
export SAFE_PEERS=/ip4/142.93.33.47/tcp/43691/p2p/12D3KooWATuSWjt61DUoqhVmHXpfenMnqMoPmrhnREQkZ25xrDGQ
safe files upload -- <path>
# alternatively using the --peer argument. It should be set during each command
safe --peer=/ip4/142.93.33.47/tcp/43691/p2p/12D3KooWATuSWjt61DUoqhVmHXpfenMnqMoPmrhnREQkZ25xrDGQ files upload -- <path>
The file-addresses of the content you’ve uploaded are saved locally and are used to enable automatic downloads. Use the following command to download them back to the ~/.safe/client/downloaded_files
folder.
To download the content you’ve just uploaded:
$ safe files download
Running a Node
You can find the safenode binaries for your platform here.
Connect your node to the network using the SAFE_PEERS
environment variable or the --peer
argument, similar to a client. Consider keeping your logs in a directory for convenience:
$ SN_LOG=all safenode --log-dir=/tmp/safenode --root-dir=/tmp/safenodedata
Windows:
$ set SN_LOG=all safenode --log-dir=/tmp/safenode --root-dir=/tmp/safenodedata
Please note that if you are running from home in a NAT environment, the node should automatically shut down after a few minutes. If this occurs, kindly share the peer ID found in the initial log lines.
Error: We have been determined to be behind a NAT. This means we are not reachable externally by other nodes. In the future, the network will implement relays that allow us to still join the network.
It should be possible to run more than one node per cloud VM (we’ve successfully run 10 on a 1 GB 1 vCPU droplet), depending on its size, CPU and memory. Please note though that logs and data will build up and may exceed the storage capacity.
Interesting Log Lines
For this testnet run, log lines containing the following keywords are important to us. If anyone running nodes can periodically pull logs and share information around these keywords, that would be amazing.
PeerAdded:
Detected dead peer
Sending a replication list
Replicate list received from
Fetching replication
Replicating chunk
Chunk received for replication
We’ve also enabled the node/client to regularly log specific metrics about the system, network, and the running process. This metric is logged in the form of JSON objects to the usual log file and thus can be parsed and piped to other applications for analysis. Below is a sample log line containing the metrics,
[2023-06-05T12:29:59.680321Z TRACE sn_logging::metrics] {"physical_cpu_threads":12,"system_cpu_usage_percent":11.24604,"system_total_memory_mb":33517.777,"system_memory_used_mb":16400.195,"system_memory_usage_percent":48.929844,"network":{"interface_name":"enp0s31f6","bytes_received":1774,"bytes_transmitted":37947,"total_mb_received":2367.518,"total_mb_transmitted":717.744},"process":{"cpu_usage_percent":1.2671595,"memory_used_mb":35.004417,"bytes_read":0,"bytes_written":8192,"total_mb_read":0.0,"total_mb_written":0.1024}}
Known problems
We’re aware that some messages are being dropped, and so the replication flow is still imperfect. We’re continuing to dig into this and have some leads. Hopefully we’ll learn more from this testnet too.