ReplicationNet [June 7 Testnet 2023] [Offline]

aatonnomicc · June 11, 2023, 4:24pm

What’s the server specs you are running your 30 nodes on? And what’s the cost? @Southside

Southside · June 11, 2023, 4:34pm

I added a 20Gb volume as well for another 1.06 euro/month

Next test net I will play around with LVM to create a 4Gb root partition and 36Gb for chunks and logs.
Or I might just no bother and leave it putting logs to the orig 20GB root partition and storing the chunks on the 20GB added volume.

But lets see what the expectations are from the team for the next one but Im happy with the price of a pint/month to join in the fun,

Vort · June 11, 2023, 4:36pm

It will allow network to stay alive for longer.
But at the same time it will be harder to find bugs which appears when network is at high load.
Hard to say what is better. Maybe alternating between payment/no-payment networks?

aatonnomicc · June 11, 2023, 4:47pm

how was it handling 30 nodes ram and cpu usage wise ?

Southside · June 11, 2023, 4:50pm

Quite possibly.
We should not be fearful of any two steps forward, one step back from the team as they feel their way forward now.
There may be some VERY specific short-running testnets just to try certain features/strategies.
We should not allow ourselves to get spoiled and expect every testnet to be 100% success - that just wont happen. But its all looking hopeful for a generally positive trajectory.

Southside · June 11, 2023, 4:52pm

There are some performance graphs further up the thread.
It was breaking sweat but handling it just fine from what I could see.
Next time if I can get using @shu’s dashboard, I hope to be able to answer that question in greater detail

aatonnomicc · June 11, 2023, 5:06pm

think we all have dashboard envy id like to know more about the set up @shu has created

Optimator · June 11, 2023, 5:11pm

Just brought my node down as well.

Logs

Great job again!

Shu · June 11, 2023, 7:44pm

Since there was a lot of discussion on Outgoing Connection Error in this post, I decided to split up the generic message based on their inner message in the safenode logs, and re-parse, re-load, and update the dashboard below. Also, added counts of fetching replication addresses from PeerIDs, and replication fetcher marking failed network addresses.

Few more observations:

There were a few types of network errors associated with Outgoing Connection Error with remote peer IDs:
- Connection Refused
- Connection Reset By Peer
- Network Unreachable
- Operation Timed Out
Dead peer detected and connection refused seem to rise and fall in tandem per hour over the timeline (2nd and 3rd panel above)
Operation timed out messages seem to not be related to Dead peer detected messages (3rd and 4th panel above)
Connection refused occurrences overall happened 3x more than operation timed out as the most common reason for the Outgoing Connection Error failing with a remote Peer ID
Overall, the amount of network related errors with Outgoing Connection Error is very low, compared to the sheer number of connections established over the past 4 days,

Hopefully, some of this helps or assists Maidsafe’s team in their deeper dive this upcoming week, .

I may a bit later decide to dive a bit deeper into the Peer Connection closed messages, and try to classify the reason for their failure based on the existing log messages.

aatonnomicc · June 11, 2023, 8:18pm

that’s her dead in the water

ubuntu@safe:~$ safe files upload ./test1.txt 
Removed old logs from directory: "/tmp/safe-client"
Logging to directory: "/tmp/safe-client"
Current build's git commit hash: 0be2ef056215680b02ca8ec8be4388728bd0ce7c
🔗 Connected to the Network                                                                                                                                                                                                                                                                                                          Storing file "test1.txt" of 2 bytes..
Successfully stored file "test1.txt" to bc4bb29ce739b5d97007946aa4fdb987012c647b506732f11653c5059631cd3d
Writing 57 bytes to "/home/ubuntu/.safe/client/uploaded_files/file_names_2023-06-11_20-16-06"


ubuntu@safe:~$ safe files download test11.txt bc4bb29ce739b5d97007946aa4fdb987012c647b506732f11653c5059631cd3d
Removed old logs from directory: "/tmp/safe-client"
Logging to directory: "/tmp/safe-client"
Current build's git commit hash: 0be2ef056215680b02ca8ec8be4388728bd0ce7c
🔗 Connected to the Network                                                                                                                                                                                                                                                                                                          Downloading file "test11.txt" with address bc4bb29ce739b5d97007946aa4fdb987012c647b506732f11653c5059631cd3d
Did not get file "test11.txt" from the network! Network Error Record was not found locally.
ubuntu@safe-hamilton:~$

Profess · June 11, 2023, 10:30pm

I have full confidence in the team, in my opinion when it comes to testing the guys know what they are doing.

Residual tests do not have to be done in order, if any part or functionality of the network is ready then it should be checked in tests, and when the time is right, they will put it all together and we will test the operation of the entire SafeNet.

Shu · June 11, 2023, 10:40pm

FWIW, below is the interaction of this node from my node (I took the opportunity to refine the backend time series data on my end, so sorting and filtering could be faster against Peer IDs):

This node seems like it was operating for 48 hours just fine based on continuous peer connection connects, and peer info received etc from my node’s perspective.

My node detected this as a dead peer at 06-09-2023 18:06:33 UTC, roughly 70 seconds earlier, along with a connection refused related message at the same time.

It seems it may take a certain amount of variable time for dead peers to be detected by their neighbors, depending on the circumstances as commented by Qi. I don’t see anything abnormal with that.

The host for that node may have crashed or rebooted, its the cloud… including multiple other droplets potentially provisioned on the same physical hardware… As with any cloud provider, cloud comes with random hardware outages, network hiccups, etc. The process may not be up anymore, and there may not be any extra forensics left to gather, if there is no further logs, .

Anyhow, I didn’t follow the dialogue above that closely regarding the interest on this specific node / dead peer, but I was curious on what scale / timespan (seconds, minutes, etc) other nodes would see the same target node as dead as well. Interesting data point to note.

joshuef · June 12, 2023, 1:00am

Not as yet. We’re still looking at why some nodes are quiet for so long.

That’s the limit!

All our droplets are also alllmost at the limit too

I second, all the other comments, @Shu that’s amazing stuff. Super helpful and really really cool to see!

We actually had a bug whereby data replication was causing connection closures, (and a lot of msg dropping around replication): fix(replication): prevent dropped conns during replication by joshuef · Pull Request #369 · maidsafe/autonomi · GitHub

So it was indeed surprising, but most likely explained by that

Yup. We’re still digging into the why of this. It’s still no clearer to me this morning.

No they should only (eventually) know about the closest nodes, and then know “less” about more distant parts of the network. There was no hard cap on node’s known (beyond the kad table, which I guess does have an inherent cap, but i’m not sure what it is off the top of my head).

With nodes being full here, I’ll probably bring this down soon enough. I am going to have a bit of a go at some network updating of a portion of the nodes and playing with some churn just now though.

Shu · June 12, 2023, 2:01am

Glad to see you, and the rest of Maidsafe team are already looking into the areas that raised slight questions from the charts above, and are all on top of it! .

It is an amazing feeling to draw similar or same conclusions as others, independently at first, and then agreeing on them when discussing it openly in a collaborative environment, especially even more if the same conclusion was derived, but the path taken was different during the journey.

i.e. Tests that run internally and publicly etc.
i.e. Data that is visualized through logs, charts, console, etc.
i.e. A team that is composed of Maidsafe, the community here, and others who silently participate too etc.

I am confident Maidsafe will address the next set of challenges ahead.

joshuef · June 12, 2023, 4:12am

I can see we’ve lost a few nodes due to OOM. Probably due to the slow mem increase as outlined above in convo between @qi_ma and @Shu .

We’ll be digging in there some, but you can expect this testnet to have more holes in in. (This may have been the event some folk noticed through a replication spike!?)

I’m still leaving it up just now, but just for folk to be aware!

Vort · June 12, 2023, 6:19am

So it may be that slow memory increase was the cause of relatively fast memory increase?
Another possibility may arise from assumption about relation of memory increase to replications triggered by dead nodes.
So if some person first integrate 100 nodes into the network, then suddenly shut them down, other nodes may shut down too, similar to what I said before:

stout77 · June 12, 2023, 12:55pm

I killed all my nodes.
For the records, here’s where they got to:

230M    /tmp/safenodedata/record_store
230M    /tmp/safenodedata
447M    /tmp/safenodedata2/record_store
447M    /tmp/safenodedata2
434M    /tmp/safenodedata3/record_store
434M    /tmp/safenodedata3
393M    /tmp/safenodedata4/record_store
393M    /tmp/safenodedata4
247M    /tmp/safenodedata5/record_store
247M    /tmp/safenodedata5
79M     /tmp/safenodedata6/record_store
79M     /tmp/safenodedata6
446M    /tmp/safenodedata7/record_store
446M    /tmp/safenodedata7
453M    /tmp/safenodedata8/record_store
453M    /tmp/safenodedata8
2.9M    /tmp/safenodedata9/record_store
2.9M    /tmp/safenodedata9

Well done again Maidsafe, looking forward to the next round!

Southside · June 12, 2023, 1:34pm

All my logs are on a suspended volume on Hetzner.
10 mins work to spin up a server, attach the volume and make them available – should I bother or do the devs have enough logs to pore over as it is?

dirvine · June 12, 2023, 1:38pm

We should be fine

Southside · June 12, 2023, 1:40pm

is the correct answer - as I would prefer to be out the back with a beer

Topic		Replies	Views
Update 25 May, 2023 Updates	14	1665	June 17, 2023
Update 15 June, 2023 Updates	18	2035	June 25, 2023
Update 11 May, 2023 Updates	29	1979	May 21, 2023
Update 18 May, 2023 Updates	32	2663	May 30, 2023
Update 22 December, 2022 Updates	33	1866	December 31, 2022

ReplicationNet [June 7 Testnet 2023] [Offline]

Related topics