What’s the server specs you are running your 30 nodes on? And what’s the cost? @Southside
I added a 20Gb volume as well for another 1.06 euro/month
Next test net I will play around with LVM to create a 4Gb root partition and 36Gb for chunks and logs.
Or I might just no bother and leave it putting logs to the orig 20GB root partition and storing the chunks on the 20GB added volume.
But lets see what the expectations are from the team for the next one but Im happy with the price of a pint/month to join in the fun,
It will allow network to stay alive for longer.
But at the same time it will be harder to find bugs which appears when network is at high load.
Hard to say what is better. Maybe alternating between payment/no-payment networks?
how was it handling 30 nodes ram and cpu usage wise ?
Quite possibly.
We should not be fearful of any two steps forward, one step back from the team as they feel their way forward now.
There may be some VERY specific short-running testnets just to try certain features/strategies.
We should not allow ourselves to get spoiled and expect every testnet to be 100% success - that just wont happen. But its all looking hopeful for a generally positive trajectory.
There are some performance graphs further up the thread.
It was breaking sweat but handling it just fine from what I could see.
Next time if I can get using @shu’s dashboard, I hope to be able to answer that question in greater detail
think we all have dashboard envy id like to know more about the set up @shu has created
Since there was a lot of discussion on Outgoing Connection Error
in this post, I decided to split up the generic message based on their inner message in the safenode
logs, and re-parse, re-load, and update the dashboard below. Also, added counts of fetching replication addresses from PeerIDs, and replication fetcher marking failed network addresses.
Few more observations:
-
There were a few types of network errors associated with
Outgoing Connection Error
with remote peer IDs:- Connection Refused
- Connection Reset By Peer
- Network Unreachable
- Operation Timed Out
-
Dead peer detected
andconnection refused
seem to rise and fall in tandem per hour over the timeline (2nd and 3rd panel above) -
Operation timed out
messages seem to not be related toDead peer detected
messages (3rd and 4th panel above) -
Connection refused
occurrences overall happened 3x more thanoperation timed out
as the most common reason for theOutgoing Connection Error
failing with a remote Peer ID -
Overall, the amount of network related errors with
Outgoing Connection Error
is very low, compared to the sheer number of connections established over the past 4 days,
Hopefully, some of this helps or assists Maidsafe’s team in their deeper dive this upcoming week, .
I may a bit later decide to dive a bit deeper into the Peer Connection closed messages, and try to classify the reason for their failure based on the existing log messages.
that’s her dead in the water
ubuntu@safe:~$ safe files upload ./test1.txt
Removed old logs from directory: "/tmp/safe-client"
Logging to directory: "/tmp/safe-client"
Current build's git commit hash: 0be2ef056215680b02ca8ec8be4388728bd0ce7c
🔗 Connected to the Network Storing file "test1.txt" of 2 bytes..
Successfully stored file "test1.txt" to bc4bb29ce739b5d97007946aa4fdb987012c647b506732f11653c5059631cd3d
Writing 57 bytes to "/home/ubuntu/.safe/client/uploaded_files/file_names_2023-06-11_20-16-06"
ubuntu@safe:~$ safe files download test11.txt bc4bb29ce739b5d97007946aa4fdb987012c647b506732f11653c5059631cd3d
Removed old logs from directory: "/tmp/safe-client"
Logging to directory: "/tmp/safe-client"
Current build's git commit hash: 0be2ef056215680b02ca8ec8be4388728bd0ce7c
🔗 Connected to the Network Downloading file "test11.txt" with address bc4bb29ce739b5d97007946aa4fdb987012c647b506732f11653c5059631cd3d
Did not get file "test11.txt" from the network! Network Error Record was not found locally.
ubuntu@safe-hamilton:~$
I have full confidence in the team, in my opinion when it comes to testing the guys know what they are doing.
Residual tests do not have to be done in order, if any part or functionality of the network is ready then it should be checked in tests, and when the time is right, they will put it all together and we will test the operation of the entire SafeNet.
FWIW, below is the interaction of this node from my node (I took the opportunity to refine the backend time series data on my end, so sorting and filtering could be faster against Peer IDs):
This node seems like it was operating for 48 hours just fine based on continuous peer connection connects, and peer info received etc from my node’s perspective.
My node detected this as a dead peer at 06-09-2023 18:06:33 UTC, roughly 70 seconds earlier, along with a connection refused related message at the same time.
It seems it may take a certain amount of variable time for dead peers to be detected by their neighbors, depending on the circumstances as commented by Qi. I don’t see anything abnormal with that.
The host for that node may have crashed or rebooted, its the cloud… including multiple other droplets potentially provisioned on the same physical hardware… As with any cloud provider, cloud comes with random hardware outages, network hiccups, etc. The process may not be up anymore, and there may not be any extra forensics left to gather, if there is no further logs, .
Anyhow, I didn’t follow the dialogue above that closely regarding the interest on this specific node / dead peer, but I was curious on what scale / timespan (seconds, minutes, etc) other nodes would see the same target node as dead as well. Interesting data point to note.
Not as yet. We’re still looking at why some nodes are quiet for so long.
That’s the limit!
All our droplets are also alllmost at the limit too
I second, all the other comments, @Shu that’s amazing stuff. Super helpful and really really cool to see!
We actually had a bug whereby data replication was causing connection closures, (and a lot of msg dropping around replication): fix(replication): prevent dropped conns during replication by joshuef · Pull Request #369 · maidsafe/safe_network · GitHub
So it was indeed surprising, but most likely explained by that
Yup. We’re still digging into the why of this. It’s still no clearer to me this morning.
No they should only (eventually) know about the closest nodes, and then know “less” about more distant parts of the network. There was no hard cap on node’s known (beyond the kad table, which I guess does have an inherent cap, but i’m not sure what it is off the top of my head).
With nodes being full here, I’ll probably bring this down soon enough. I am going to have a bit of a go at some network updating of a portion of the nodes and playing with some churn just now though.
Glad to see you, and the rest of Maidsafe team are already looking into the areas that raised slight questions from the charts above, and are all on top of it! .
It is an amazing feeling to draw similar or same conclusions as others, independently at first, and then agreeing on them when discussing it openly in a collaborative environment, especially even more if the same conclusion was derived, but the path taken was different during the journey.
i.e. Tests that run internally and publicly etc.
i.e. Data that is visualized through logs, charts, console, etc.
i.e. A team that is composed of Maidsafe, the community here, and others who silently participate too etc.
I am confident Maidsafe will address the next set of challenges ahead.
I can see we’ve lost a few nodes due to OOM. Probably due to the slow mem increase as outlined above in convo between @qi_ma and @Shu .
We’ll be digging in there some, but you can expect this testnet to have more holes in in. (This may have been the event some folk noticed through a replication spike!?)
I’m still leaving it up just now, but just for folk to be aware!
So it may be that slow memory increase was the cause of relatively fast memory increase?
Another possibility may arise from assumption about relation of memory increase to replications triggered by dead nodes.
So if some person first integrate 100 nodes into the network, then suddenly shut them down, other nodes may shut down too, similar to what I said before:
I killed all my nodes.
For the records, here’s where they got to:
230M /tmp/safenodedata/record_store
230M /tmp/safenodedata
447M /tmp/safenodedata2/record_store
447M /tmp/safenodedata2
434M /tmp/safenodedata3/record_store
434M /tmp/safenodedata3
393M /tmp/safenodedata4/record_store
393M /tmp/safenodedata4
247M /tmp/safenodedata5/record_store
247M /tmp/safenodedata5
79M /tmp/safenodedata6/record_store
79M /tmp/safenodedata6
446M /tmp/safenodedata7/record_store
446M /tmp/safenodedata7
453M /tmp/safenodedata8/record_store
453M /tmp/safenodedata8
2.9M /tmp/safenodedata9/record_store
2.9M /tmp/safenodedata9
Well done again Maidsafe, looking forward to the next round!
All my logs are on a suspended volume on Hetzner.
10 mins work to spin up a server, attach the volume and make them available – should I bother or do the devs have enough logs to pore over as it is?
We should be fine
is the correct answer - as I would prefer to be out the back with a beer