It is doing what it should and, at this stage, losing data correctly. What I mean is this
Nodes are capped
As a node is full it will store no more (uploads should fail, but some may not)
A full node cannot replace another node under churn (as it cannot store what that node stored) i.e. at this stage we have data loss
Wee bit of a deeper dive
Maybe fine (but how do we make that cap dynamic across the network or how do we handle does of differing sizes)
As nodes get near full the design is that farming rate goes up, encouraging more nodes to join (this is key)
So without âpay for dataâ (likely next testnet) we have free uploads to a limited disk, it will fill and in our case lose data.
Right now we could actually have nodes delete data they are no longer responsible for and that would mean the network cannot store more, but does not lose any data as nodes can replace other nodes (more or less, if the network shrinks then this still loses data). This is another reason for very large networks as small networks have much larger distances between nodes and those distances are much more unbalanced, so again data loss due to different requirements of storage in a close group. Larger networksâ close groups will be much more similar, small networks, they are all over the shop in space requirements.
I hope this helps everyone get into the weeds here. We are approaching the critical part where we stitch together the parts and this farming algorithm will be tweaked (mostly as we will be using free coins and again, mega uploads etc. etc. etc.). This tweaking part will be key, and the network should always break until all the parts are stitched together. Otherwise, why do all the parts?
Interesting days, this should have been us in 2015, but the tech was just not there. It is now mostly (I still hanker for QUIC hole punch at the network stack and perhaps BLS keys with added quantum protection throughout).
the network should and must always break until all the parts are stitched together
This is the most important thing to realise right now
^ do this large post means that nodes should crash?
If âyesâ, then what not-yet-stitched parts are responsible for it?
If ânoâ, then where is 12D3KooWHs2FuFcuSHtkt1KdCAKDnXp35EDbkrcx559rp7TMrj9n now?
And if it crashed, then why and how many nodes are crashed as well?
Do we have the method for paying for data fully determined yet? Is it the system worked out previously where the client gets a âquoteâ for storing âNâ chunks and if accepted/paid for, the client can upload the âNâ chunks anytime afterwards.
Is this still the plan?
If so then I wonder if there could be an issue if large amount of data has been paid for but delayed in uploaded till the network is getting full. They still have the ability to upload making it fuller.
I realise for a very large network this should not be an issue since the amount of uploads delayed by large time will be minimal, even if bad actor tries to game it. BUT for a new/small/medium network or during massive outage of say China+India can this be a problem?
I could say it more or less is, bu this is the point where we will all chime in. So we need to get a pay-for-data testnet in some form, and then we can all tweak that. It will be quite deep and a risk for sure, but itâs a key part we cannot knee-jerk.
It means the network is incomplete and will fail and do so in different ways. What we want to make sure is the failure is not bugs/crashes etc.
Donât know what you mean here. We cannot cobble together any form of the network that is incomplete and keeps working. You could do some bad centralised things, but that tells us very little
If total amount of tokens will correspond to initial network capacity, then network canât be theoretically full (in case initial nodes are alive of course).
Looks like itâs whatâs planned, at least for test networks.
If it will not corresond, then I donât see much difference to how tests are performed right now - there will be no protection from overflow.
Mainly because I am answering you correctly. If you think we design networks to crash, kill children, blow up a bus load of nuns and split the planets core, we donât.
You talk of nodes crashing as it sounds quite exciting and in actual fact is just showing your desire for disaster.
They can always crash, given enough time your input would also be positive. Itâs all just possibilities really.
Did you forget the node operators are paid from this and then they can pay for more. And during all this the other 70% is being slowly dispersed. Even the 30% to the foundation will eventually* be spent to upload data, so in effect there will be an unlimited amount of SNT to pay for data, just not all available at once
My desire is to find root causes of problems as early as possible.
And not when âall parts are stitched togetherâ.
For me it looks like natural development process.
Reluctance to do so is what looks like origins of disaster for me.
Not bugs/crashes by itself - they are normal part of every development.
Yes .
It means then that introduction of payments will just slow down the filling of nodes. And at the end, they
For sure node crashed could be one of the reasons.
Itâs just other stuff, like the original holder is a temp one and got restarted, or original holder got pushed out of the range hence no longer being part of the close range.
Combined with other situations I mentioned previously, they all cause the vanish of the 8 copies.
And they all will be investigated.
A dead peer detected triggers a replication among nodes, which will consume more mem during the period, and also may have the problem of data loss during transmission that mentioned.
I asssume this is from the log you shared at ReplicationNet [June 7 Testnet 2023] [Offline] - #163 by Vort, which shows a total of 127 dead_peer detections during 8 hours period.
Not sure if itâs a reasonable num of dead_peers detected, but it does cause some oscillation in memory usage.
Regarding that node particular, I donât get the full log of it so far. So canât tell what happend to it.
Maybe just a normal temp hosted node got restarted, or maybe hang like other nodes reported as not having a growing log, or anything else.
Anyway, will keep an eye on it when possible.
1st, a full node still consumes mem/traffic when accepting chunk copies of store/replicate. Itâs just it wonât create a chunk file in disk.
2nd, the total_mb_read collected is still bit mysterious to me, as Shu reported, there is only 4kb read with hundreds of MBs writes. It maybe explanable, just I donât get it yet.
Hi, @Shu , thatâs really a fab statistics/diagrams down with in-depth observations.
great work done and really appreciate that. thx a lot for the effort.
When storing chunk, itâs multiple broadcasting used among the CLOSE_GROUP_SIZE(8) nodes.
i.e. itâs expected you will receive 8 PUT requests.
For sure we will filter out duplicated ones, hence only one chunk file will be created.
Meanwhile, the replicate is a one-to-one request, i.e. one PUT request for one chunk file to be created.
So, overall, the PUT requests v.s. the chunk files created, will be a number lower than 8, and how close to it depends on how many replication involved.
Could be more but shall be a close number as I understand.
It have the same IP as mentioned in 1st post, 165.232.106.150, I thought that no restarts were planned for those nodes.
Looks fine for my node: [2023-06-11T12:03:48.236602Z TRACE sn_logging::metrics] {"physical_cpu_threads":4,"system_cpu_usage_percent":44.088863,"system_total_memory_mb":8461.066,"system_memory_used_mb":5606.904,"system_memory_usage_percent":66.267105,"network":null,"process":{"cpu_usage_percent":0.93167704,"memory_used_mb":155.50874,"bytes_read":0,"bytes_written":9564,"total_mb_read":405.95673,"total_mb_written":946.9618}}
So I think that it is more likely that something went wrong with Shu node, than that total_mb_read calculation is broken.
Np, it was a learning exercise digging into the safenode logs, specific container stats aggregation, back-end time series DB, and various other tools required to do all the plumbing to have the data be ingested in real time from the true source, i.e. safenode logs, and all the way up to the frontend dashboard.
Outside of that, I was very interested to see what insights the data would reveal, and whether it would help your team and the community further confirm, identify bottlenecks, and validate the design decisions made thus far, .
Below are some further questions, and observations, open to you, Maidsafe team, as well as the community here for further discussion and commentary, if so desired or wanted:
Did the distribution of message types count (% ) across the categories specified in the images here surprise anyone in terms of how a single node behaved in steady state contiguously?
I hope my node was not an outlier in terms of the pattern observed, as I would expect many individual nodes to behave in a similar pattern, so digging deep over the timeline of a single node or so seemed like a good idea.
On further review, for what its worth, on a per hour basis on the charts above, whenever there was 1+ dead peer detected message, there was at least 1+ Outbound Connection Error in the same say 1hr time window, but not vice versa for other time ranges.
I donât know much on the inner workings of libp2p etc, but maybe its something with my node, but I am still surprised by the level of connection closed and connections connected messages here, as a majority of the peers, 2000 (provided by Maidsafe) out of ~2150+ peers likely just continued to stay up and available in the first 3-4 days without issues, but this might be all okay (short vs long lived persistent connections between peer nodes and their close groups).
I guess is the above in line with the current design and expectations, or folks find it interesting as a statistic as is, but nothing obvious as a concern at this stage?
The ratio is indeed around 7.8x for PUT requests to Chunk Writes for my node, so it falls in line with your explanation above. Thanks!
Just saw that David mentioned earlier in this topic some of the throughput has to do with certain timeouts that shouldnât exist, and is a work-in-progress there, great!
Did this surprise anyone especially in periods when the node itself was not asked to store a PUT request, yet logs showed peer connections closed and connected in a steady state numerous times against a set of unique peerIds which happened to be in 100s+ per peerID in the first 24 hours?
If its in-line with expectations, all good from my end, though I may want to to dig deeper on the reasons of the connections closed inner messages here to help me better understand the why, hmm.
How would one have thought this would have played out for both the connections closed and connections connected distribution on the histogram given 90% of the nodes were up in the first 3 days without disk space concerns or the 1024 chunk limits being hit yet (assuming here for Maidsafeâs nodesâ health, ) ?
Left skewed, right skewed, or normal distribution? and why?
I am not sure what payload this is carrying, but it seems reasonable if load is to be distributed across the network, incoming vs outgoing for certain type of data should be near 50/50 split, otherwise its a one-way flood storm?
Seeing that the ratio is nearly 1:1, is this in-line with the current design and expectations?
Always nice to see numbers add up or equal the expected outcome here. No further comments here.
I was curious here, if any single node ends up actually seeing most if not all of the networkâs peerID addresses based on the address discovery phase etc over time. You mentioned a cache regarding the routing tables etc, and it being uncapped currently, is the node expected to discover nearly 90%+ of the peerIDs on the network ?
I assume the churn was very less in the first 24 to 48hrs since 90% of the nodes continued to stay up and healthy etc.
Overall, I am extremely delighted with the outcome of this testnet in terms of the stability, just like many others!
I thank you for your time in providing the explanations earlier, and helping me better understand safenetwork architecture!
And if I raised too many questions in one go, I apologize, as I know your team is busy and its the weekend too, .
Although I really wanted to, I did not manage to take part in the network testing (but everything is ahead of me ;))
However, I have reviewed the entire thread and the statements of the testers show that the network is making great progress!
Big congratulations and thanks to everyone: the MS team, the testers and the supporters!!
Yes I have snapshotted my server and closed down my 30 nodes.
Iâll wait for some kind of summary from the team but I think this has been generally a huge success.
Lets see what a couple of days poring over logs brings - then its on to the next testnet
Someone said DBCNet - dunno if we are 100% ready for that just now but I will be delighted to be proved wrong.
Thanks to all who made this possible and all who participated.