ReplicationNet [June 7 Testnet 2023] [Offline]

dirvine · June 11, 2023, 7:03am

I think this is good @qi_ma @joshuef

It is doing what it should and, at this stage, losing data correctly. What I mean is this

Nodes are capped
As a node is full it will store no more (uploads should fail, but some may not)
A full node cannot replace another node under churn (as it cannot store what that node stored) i.e. at this stage we have data loss

Wee bit of a deeper dive

Maybe fine (but how do we make that cap dynamic across the network or how do we handle does of differing sizes)
As nodes get near full the design is that farming rate goes up, encouraging more nodes to join (this is key)

So without “pay for data” (likely next testnet) we have free uploads to a limited disk, it will fill and in our case lose data.

Right now we could actually have nodes delete data they are no longer responsible for and that would mean the network cannot store more, but does not lose any data as nodes can replace other nodes (more or less, if the network shrinks then this still loses data). This is another reason for very large networks as small networks have much larger distances between nodes and those distances are much more unbalanced, so again data loss due to different requirements of storage in a close group. Larger networks’ close groups will be much more similar, small networks, they are all over the shop in space requirements.

I hope this helps everyone get into the weeds here. We are approaching the critical part where we stitch together the parts and this farming algorithm will be tweaked (mostly as we will be using free coins and again, mega uploads etc. etc. etc.). This tweaking part will be key, and the network should always break until all the parts are stitched together. Otherwise, why do all the parts?

Interesting days, this should have been us in 2015, but the tech was just not there. It is now mostly (I still hanker for QUIC hole punch at the network stack and perhaps BLS keys with added quantum protection throughout).

the network should and must always break until all the parts are stitched together

This is the most important thing to realise right now

Vort · June 11, 2023, 8:00am

^ do this large post means that nodes should crash?
If “yes”, then what not-yet-stitched parts are responsible for it?
If “no”, then where is 12D3KooWHs2FuFcuSHtkt1KdCAKDnXp35EDbkrcx559rp7TMrj9n now?
And if it crashed, then why and how many nodes are crashed as well?

neo · June 11, 2023, 8:31am

Do we have the method for paying for data fully determined yet? Is it the system worked out previously where the client gets a “quote” for storing “N” chunks and if accepted/paid for, the client can upload the “N” chunks anytime afterwards.

Is this still the plan?

If so then I wonder if there could be an issue if large amount of data has been paid for but delayed in uploaded till the network is getting full. They still have the ability to upload making it fuller.

I realise for a very large network this should not be an issue since the amount of uploads delayed by large time will be minimal, even if bad actor tries to game it. BUT for a new/small/medium network or during massive outage of say China+India can this be a problem?

dirvine · June 11, 2023, 8:41am

I could say it more or less is, bu this is the point where we will all chime in. So we need to get a pay-for-data testnet in some form, and then we can all tweak that. It will be quite deep and a risk for sure, but it’s a key part we cannot knee-jerk.

It means the network is incomplete and will fail and do so in different ways. What we want to make sure is the failure is not bugs/crashes etc.

Don’t know what you mean here. We cannot cobble together any form of the network that is incomplete and keeps working. You could do some bad centralised things, but that tells us very little

Vort · June 11, 2023, 8:43am

If total amount of tokens will correspond to initial network capacity, then network can’t be theoretically full (in case initial nodes are alive of course).
Looks like it’s what’s planned, at least for test networks.

If it will not corresond, then I don’t see much difference to how tests are performed right now - there will be no protection from overflow.

Vort · June 11, 2023, 8:50am

Why not just answer “yes” or “no”?

Do this means “no, nodes should not crash”?
If so, then what happened to node 12D3KooWHs2FuFcuSHtkt1KdCAKDnXp35EDbkrcx559rp7TMrj9n?

I mean that answer “yes, nodes can crash, we know about the reasons (X, Y) but did not implemented fix yet because of reason Z” is valid too.

dirvine · June 11, 2023, 8:56am

Mainly because I am answering you correctly. If you think we design networks to crash, kill children, blow up a bus load of nuns and split the planets core, we don’t.

You talk of nodes crashing as it sounds quite exciting and in actual fact is just showing your desire for disaster.

They can always crash, given enough time your input would also be positive. It’s all just possibilities really.

neo · June 11, 2023, 9:01am

Did you forget the node operators are paid from this and then they can pay for more. And during all this the other 70% is being slowly dispersed. Even the 30% to the foundation will eventually* be spent to upload data, so in effect there will be an unlimited amount of SNT to pay for data, just not all available at once

Vort · June 11, 2023, 9:05am

My desire is to find root causes of problems as early as possible.
And not when “all parts are stitched together”.

For me it looks like natural development process.
Reluctance to do so is what looks like origins of disaster for me.
Not bugs/crashes by itself - they are normal part of every development.

Yes .
It means then that introduction of payments will just slow down the filling of nodes. And at the end, they

qi_ma · June 11, 2023, 10:25am

For sure node crashed could be one of the reasons.
It’s just other stuff, like the original holder is a temp one and got restarted, or original holder got pushed out of the range hence no longer being part of the close range.
Combined with other situations I mentioned previously, they all cause the vanish of the 8 copies.
And they all will be investigated.

A dead peer detected triggers a replication among nodes, which will consume more mem during the period, and also may have the problem of data loss during transmission that mentioned.
I asssume this is from the log you shared at ReplicationNet [June 7 Testnet 2023] [Offline] - #163 by Vort, which shows a total of 127 dead_peer detections during 8 hours period.
Not sure if it’s a reasonable num of dead_peers detected, but it does cause some oscillation in memory usage.

Regarding that node particular, I don’t get the full log of it so far. So can’t tell what happend to it.
Maybe just a normal temp hosted node got restarted, or maybe hang like other nodes reported as not having a growing log, or anything else.
Anyway, will keep an eye on it when possible.

1st, a full node still consumes mem/traffic when accepting chunk copies of store/replicate. It’s just it won’t create a chunk file in disk.
2nd, the total_mb_read collected is still bit mysterious to me, as Shu reported, there is only 4kb read with hundreds of MBs writes. It maybe explanable, just I don’t get it yet.

qi_ma · June 11, 2023, 10:48am

Hi, @Shu , that’s really a fab statistics/diagrams down with in-depth observations.
great work done and really appreciate that. thx a lot for the effort.

When storing chunk, it’s multiple broadcasting used among the CLOSE_GROUP_SIZE(8) nodes.
i.e. it’s expected you will receive 8 PUT requests.
For sure we will filter out duplicated ones, hence only one chunk file will be created.
Meanwhile, the replicate is a one-to-one request, i.e. one PUT request for one chunk file to be created.
So, overall, the PUT requests v.s. the chunk files created, will be a number lower than 8, and how close to it depends on how many replication involved.

Could be more but shall be a close number as I understand.

Vort · June 11, 2023, 12:06pm

It have the same IP as mentioned in 1st post, 165.232.106.150, I thought that no restarts were planned for those nodes.

Looks fine for my node:
[2023-06-11T12:03:48.236602Z TRACE sn_logging::metrics] {"physical_cpu_threads":4,"system_cpu_usage_percent":44.088863,"system_total_memory_mb":8461.066,"system_memory_used_mb":5606.904,"system_memory_usage_percent":66.267105,"network":null,"process":{"cpu_usage_percent":0.93167704,"memory_used_mb":155.50874,"bytes_read":0,"bytes_written":9564,"total_mb_read":405.95673,"total_mb_written":946.9618}}
So I think that it is more likely that something went wrong with Shu node, than that total_mb_read calculation is broken.

Shu · June 11, 2023, 1:44pm

Np, it was a learning exercise digging into the safenode logs, specific container stats aggregation, back-end time series DB, and various other tools required to do all the plumbing to have the data be ingested in real time from the true source, i.e. safenode logs, and all the way up to the frontend dashboard.

Outside of that, I was very interested to see what insights the data would reveal, and whether it would help your team and the community further confirm, identify bottlenecks, and validate the design decisions made thus far, .

Below are some further questions, and observations, open to you, Maidsafe team, as well as the community here for further discussion and commentary, if so desired or wanted:

Did the distribution of message types count (% ) across the categories specified in the images here surprise anyone in terms of how a single node behaved in steady state contiguously?

I hope my node was not an outlier in terms of the pattern observed, as I would expect many individual nodes to behave in a similar pattern, so digging deep over the timeline of a single node or so seemed like a good idea.

On further review, for what its worth, on a per hour basis on the charts above, whenever there was 1+ dead peer detected message, there was at least 1+ Outbound Connection Error in the same say 1hr time window, but not vice versa for other time ranges.

I don’t know much on the inner workings of libp2p etc, but maybe its something with my node, but I am still surprised by the level of connection closed and connections connected messages here, as a majority of the peers, 2000 (provided by Maidsafe) out of ~2150+ peers likely just continued to stay up and available in the first 3-4 days without issues, but this might be all okay (short vs long lived persistent connections between peer nodes and their close groups).

I guess is the above in line with the current design and expectations, or folks find it interesting as a statistic as is, but nothing obvious as a concern at this stage?

The ratio is indeed around 7.8x for PUT requests to Chunk Writes for my node, so it falls in line with your explanation above. Thanks!

Just saw that David mentioned earlier in this topic some of the throughput has to do with certain timeouts that shouldn’t exist, and is a work-in-progress there, great!

Did this surprise anyone especially in periods when the node itself was not asked to store a PUT request, yet logs showed peer connections closed and connected in a steady state numerous times against a set of unique peerIds which happened to be in 100s+ per peerID in the first 24 hours?

If its in-line with expectations, all good from my end, though I may want to to dig deeper on the reasons of the connections closed inner messages here to help me better understand the why, hmm.

How would one have thought this would have played out for both the connections closed and connections connected distribution on the histogram given 90% of the nodes were up in the first 3 days without disk space concerns or the 1024 chunk limits being hit yet (assuming here for Maidsafe’s nodes’ health, ) ?

Left skewed, right skewed, or normal distribution? and why?

I am not sure what payload this is carrying, but it seems reasonable if load is to be distributed across the network, incoming vs outgoing for certain type of data should be near 50/50 split, otherwise its a one-way flood storm?

Seeing that the ratio is nearly 1:1, is this in-line with the current design and expectations?

Always nice to see numbers add up or equal the expected outcome here. No further comments here.

I was curious here, if any single node ends up actually seeing most if not all of the network’s peerID addresses based on the address discovery phase etc over time. You mentioned a cache regarding the routing tables etc, and it being uncapped currently, is the node expected to discover nearly 90%+ of the peerIDs on the network ?

I assume the churn was very less in the first 24 to 48hrs since 90% of the nodes continued to stay up and healthy etc.

Overall, I am extremely delighted with the outcome of this testnet in terms of the stability, just like many others!

I thank you for your time in providing the explanations earlier, and helping me better understand safenetwork architecture!

And if I raised too many questions in one go, I apologize, as I know your team is busy and its the weekend too, .

Profess · June 11, 2023, 2:13pm

Although I really wanted to, I did not manage to take part in the network testing (but everything is ahead of me ;))
However, I have reviewed the entire thread and the statements of the testers show that the network is making great progress!

Big congratulations and thanks to everyone: the MS team, the testers and the supporters!!

There is power!!!

Secretariat415 · June 11, 2023, 2:17pm

@Shu I totally love your profile photo. Where did you get it? I want to have one just like it!

Josh · June 11, 2023, 3:38pm

Great work Maidsafe!

Brought my node down,
434M record_store
Logs
512 MB Memory / 10 GB Disk / NYC1 - Ubuntu 22.10 x64

Shortly after the above screenshot.

scottefc86 · June 11, 2023, 4:06pm

Would look better in green

Southside · June 11, 2023, 4:07pm

So would Everton - but they have never come close to earning that privilege

aatonnomicc · June 11, 2023, 4:18pm

I finally got a node online I forgot the fire port rule I was using was only for udp

Looks like downloading of old uploads is mostley failing now. It was a great run and can’t wait for the next one!

Southside · June 11, 2023, 4:22pm

Yes I have snapshotted my server and closed down my 30 nodes.
I’ll wait for some kind of summary from the team but I think this has been generally a huge success.
Lets see what a couple of days poring over logs brings - then its on to the next testnet

Someone said DBCNet - dunno if we are 100% ready for that just now but I will be delighted to be proved wrong.

Thanks to all who made this possible and all who participated.

Topic		Replies	Views
Update 25 May, 2023 Updates	14	1665	June 17, 2023
Update 15 June, 2023 Updates	18	2035	June 25, 2023
Update 11 May, 2023 Updates	29	1979	May 21, 2023
Update 18 May, 2023 Updates	32	2666	May 30, 2023
Update 22 December, 2022 Updates	33	1866	December 31, 2022

ReplicationNet [June 7 Testnet 2023] [Offline]

the network should and must always break until all the parts are stitched together

Related topics