NoClientGossipDiscovery [24/11/23 Testnet] [Offline]

Little late to the party, but working on getting a node fired up, :smiley: .

I am blanking out here (forgot what I did last testnet), but for those who are starting up safenode manually with args + compiling it from source with other features, what is at least 1 PEER_ID that should be set an env. var or passed into safenode pid?

I didn’t see one mentioned in the OP.

Is there a static URL that contains the network PEER IDs published by MaidSafe’s nodes that can be used to bootstrap the node? i.e. location of the network_contacts file? (if that still exists)?

1 Like

I’m almost sure they are here:
https://sn-testnet.s3.eu-west-2.amazonaws.com/network-contacts

2 Likes

Excellent! I knew it existed but wasn’t able to find the URL when searching in the forums.

It be nice if this handy static URL is also part of the OP (future testnets) template (instructions).

2 Likes

My latency is typically around 9 ms, it stayed there for the first part of this test but since the memory started climbing so has latency.

The amount of nodes you run is going to affect this, so a nice indicator.
Current Latency to 8.8.8.8: 52.308 ms

Command for average latency:
ping -c 4 8.8.8.8 | tail -1| awk '{print $4}' | cut -d '/' -f 2

2 Likes

Most likely, latency increased because of increase in network usage and not because of RAM.
But RAM and network usage may increase because of some common event, yeah.

Pings are processed at some low level in operating systems (kernel or so) and such processing not requires much RAM.

1 Like

Yes, was not saying it was due to RAM just started at around the same time. :+1:

1 Like
$ safe -V
sn_cli 0.86.20

:+1:

2 Likes

Node is up and running for roughly past 1.5hrs:

Features added:

  • Transfer rate of Gossip Sub Messages from metrics endpoint being received.
  • Fixed network traffic rate (correct units) on the derivative function

Note: The areas where this is ‘No Data’ particularly under ‘PUT Request’, ‘Chunk Written’, and ‘Chunk Deleted’, is likely due to logs being changed on disk, and I would need to circle back and re-update the parser logic on my end (TBD). In addition, the ‘Unique Peer IDs’, ‘Unique Request IDs’, and ‘Unique Chunk Addresses’ also broke from my parser logic back-end, due to the log format changing (TBD).

However, the ‘PUT Records OK’, ‘SN Networking Records Stored’, ‘SN Replication Triggered’ etc are all coming from metrics endpoint.

Few Observations:

  • Every-time there is a large burst of chunk being stored, the # of peers connected remain higher than previous baseline, and do not seem to decrease.

  • In the case above, the network traffic also nearly doubles / (linear scaling?), and remains at the higher levels due to likely more peers being actively connected to the node.

  • The LibP2P identify sent and received periodically also increase (new maximum) once more chunks are stored, including a rise in their average baseline.

  • Gossip Sub Messages received in this case seem to remain steady and flat (reducing in overall maximum), but the base minimum seems to be increasing (the rate) from starting initial point.

  • Not sure why LIBP2P - Observed Addresses & Connected Nodes show no data from metrics endpoint (TBD).

Zooming in on the above observations here:

Further observations after initial post:

  • I can’t tell if the chunks that arrived are purely due to replication or new data being stored, as the SN Replication Triggered is showing 0 and the SN Wallet Balance also shows 0. I am guessing this is existing replicated data, and the SN Wallet Balance is accurate (being at 0.0), and likely the SN Replication Triggered is from the perspective of my node, and not chunk replication being received that was triggered from other nodes? Hmmm.

image

  • Is the memory rise due to the # of peers that increase post a burst of chunks stored in waves, or a combination of extra peers actively connected as well as a potential list of chunk addresses stored on disk (kept in-memory)?
11 Likes

Is something wrong? I had a node running but due to some spectacular idiocy trying to start some on another machine I logged onto the wrong one and killed it. Now I can’t get a single one running on anything in AWS or at home. I’m getting this in the logs:-

[2023-11-24T22:18:11.820741Z TRACE sn_node::replication] Not having enough peers to start replication: 1/20

I didn’t change anything in between.

There are a lot of Failed to dial messages near the start of the log.

I’m using the correct version and it downloaded the list of peers.

EDIT
I see this in the logs:-

[2023-11-24T21:53:11.820059Z INFO sn_node::node] Node has been subscribed to gossipsub topic 'TRANSFER_NOTIFICATION' to receive network royalties payments notifications.

So that sounds good but it’s the only indication I’m connected to anything.

2 Likes

I had this issue too when I grabbed the 1st PEER_ID in the network-contacts URL noted earlier above by Vort for a manual kick start off safenode pid.

I ended up setting export SAFE_PEERS= on the shell to the 2nd PEER ID in the list above, and that was able to get me to bootstrap properly.

safe-build-122:# echo $SAFE_PEERS
/ip4/46.101.2.43/tcp/35669/p2p/12D3KooWBeQ6RLvhsprzvG6xFuvJQc9FmjXd89toqPU1K4mXxMaA
1 Like

I am seeing similar.
I have just killed them all as the cloud instance was asking for a reboot.

I’ll start from scratch again in a min and report back.

Thanks! I’ve tried a random few of them though and got the same result.

Can anyone supply a working peer they’ve started?

You can get it from the first safenode.log with:-

grep 'Local node is listening' safenode.log

and also supplying your external IP address if you are comfortable doing that.

Less to worry about if the node is cloud based.

1 Like

An up to date URL should always be in the code here:

3 Likes

Nice! I fully agree with you :smiley: .

I was searching for network_contacts and not network-contacts, whoops!

I should have tried more permutations but I have added the URL in my scripts as a side comment so I don’t forget the format of the filename in the URL.

On side note, I don’t think Github used to require an account to just search the repo before… sigh, :frowning: :

image

1 Like

Restarted my cloud instance with an initial single node.

A lot of errors similar to

ERROR sn_node::node] Failed to dial /ip4/68.183.33.184/tcp/46521/p2p/12D3KooWA9VCF5s8SiaDjeSRST7D9yT6nr5aJpseb1xowWPRy2B5: DialError(DialPeerCon
ditionFalse(NotDialing))

also

ERROR sn_networking::event] OutgoingConnectionError to PeerId("12D3KooWDdbpuGWetCZnG1EPz1YspTnjkX21hE35XXJL1ZrKbPDr") on ConnectionId(1) - Transport([("/ip4/138.68.159.23/tcp/45669/p2p/12D3KooWDdbpuGWetCZnG1EPz1YspTnjkX21hE35XXJL1ZrKbPDr", Other(Custom { kind: Other, error: Custom { kind: Other, error: Left(Left(Os { code: 111, kind: ConnectionRefused, message: "Connection refused" })) } }))])

and


TRACE sn_node::replication] Not having enough peers to start replication: 1/20

Memory usage has gone up a bit:

PID: 145576
Memory used: 685.562MB

So we have most likely lost nodes

Same here. No connection.

Btw has anyone tried Registers stuff? All functioning?

I don’t know why, but CPU usage went quite high compared to start of the network. The order of colapse I see is this:

  1. Node (or whole machine) CPU usage jumps up and starts hitting 100%. (this causes the higher latency I think)
  2. Memory usage starts going up (probably because work backlog). Problem is it stays up even when there is enough CPU power available again.
  3. Nodes start dying because machine runs out of memory.

I see it only on low power machines, server with average CPU is fine so far.

How much CPU that node uses?

CPU usage: 11.6%