We haven’t dug down into this yet, but will be at a point soon to be doing so I think. Good to have folks keeping an eye on this!
Probably should not be logged as an error at the node.
Hmmm, it should have shut down if it received the message you noted?
Sadly nothing around NAT is completely deterministic. There’s a confidence level involved, and if you’re lucky the node mayyy think it’s in the clear, when it is in fact not.
I’m curious, how much turnover of nodes we have had? Is there any way to know if and when nodes have left the network? Especially curious, if there has been events, when large number of nodes have dropped in a short time period?
There’s some metric crates libp2p has we’ll dive into that might tell us more.
I can say just now, maidsafe nodes seems stable. We will be looking to add more churn to upcoming testnets though.
Also: happy day!
In other news, there’s a chance I’ll be bringing this down today. The network has served it’s purpose well, so thanks so far for everyone who’s got stuck in!
That is not normal. I am also running nodes behind double NAT with port forwarding and I am not experiencing this. I have 110 nodes running form start of this testnet and according to my router the number of TCP connections is oscillating around 50-60k and almost all of them are in “established” state.
My guess would be some kind router issue, for example it ran out of memory and lost most of connections, and the one node had enough surviving connections not to issue the autoshutdown.
That’s me bringing this down now. Thanks all for getting involved and poking this testnet! Very positive outcome for what we were checking here (and other facets too!)
One of my nodes mysteriously kicked the bucket 26177 30 minutes after your post about bringing this down, so I guess you killed your nodes and it had a knock-on effect but I don’t see a spike in activity before it’s sudden demise.
Great work guys, the end seems near yet the real journey is only about to begin!
I am summarizing the timeline for now since I will take down my node soon too. Even though I started much later on this testnet, I am glad to have participated with everyone else here!
There were 4 specific spikes for me in the last few hours. 1st and 3rd spike batch used up 75% of CPU on 4 core machine, the 2nd and 4th spike went up to 25 % each or so.
Note: On the 1st & 3rd spike group, the kBuckets (buckets & peers) went downward, which during that downward trend, a ton of CPU was consumed… whether it was due to more libp2p messages or actively re-updating the kBucket data structures (Maidsafe team would know more here), however, the # of raw count of logged messages parsed dramatically lowered (specifically the Peer Info Sent/Received messages), during this period.
Note: Maybe I am not picking up on certain type of activity here, but that might be at the lowest layers, and potentially not being logged intentionally (i.e it could result in super excessive messages or slow down too).
Note: On the 2nd and 4th spike, the kBuckets (buckets & peers) stats remained flat, but the network logged a ton of ‘GET’ requests, which were the main contributor to the overall raw # of logged messages parsed in the panel above, and likely contributed to the two 25 % CPU spike batches.
Unique Peer IDs went up to 9700+ from 2965+ within 24hrs? Wow.
Note: I don’t think I am double counting here, but will double check.
FWIW, kBucket stats dropped to 7 buckets with 54 peers, compared to 10 buckets and 154 peers (steady state). I am not sure what the distribution here should ideally be between buckets and peers #, but seems all is well as node continues to perform well.
The ERROR level messages that were parsed and logged pertained to the SN_NETWORKING:MSG: components with respect to specifically inbound & outbound response failure messages. This seems expected due to the network capacity being reduced, and they also seem to take place right after the kBucket statistics panel bucket and peer #s start dropping.