Calculating the probability of data persistence

Toivo · February 1, 2024, 4:21pm

Is it so that losing one chunk makes the file unusable?

dirvine · February 1, 2024, 4:28pm

Yes a single chunk completely off the network would render a file useless.

You can do reed-solomon etc. on chunks to make them able to resist this. So say you get 70% of parts can make up the file or even 30%, but it’s actually equivalent to just more replicas.

i.e. if the tolerance is 70% then 70% plus one kills it.

Toivo · February 1, 2024, 4:51pm

Some calculations from the tool, file and chunk size are irrelevant here. These are calculated for 20% outage. This is to demonstrate the effect of replication count as per @TylerAbeoJordan’s request:

Summary

File size for given prognosis
prognosis of file:	-
sudden network outage:	0.2
copies per chunk:	1
chunk size (bytes):	-
prognosis per chunk:	0.8

File size for given prognosis
prognosis of file:	-
sudden network outage:	0.2
copies per chunk:	2
chunk size (bytes):	-
prognosis per chunk:	0.96

File size for given prognosis
prognosis of file:	-
sudden network outage:	0.2
copies per chunk:	5
chunk size (bytes):	-
prognosis per chunk:	0.99968

File size for given prognosis
prognosis of file:	-
sudden network outage:	0.2
copies per chunk:	8
chunk size (bytes):	-
prognosis per chunk:	0.99999744

File size for given prognosis
prognosis of file:	-
sudden network outage:	0.2
copies per chunk:	10
chunk size (bytes):	-
prognosis per chunk:	0.9999998976

So the bigger the replication count, the bigger the chances of survival. But the chance is never 1. With bigger files the chances of missing one of the thousands of chunks increase.

So we could think how the network should be designed in order to compare favourably to other storage systems? What is a likelihood that some cloud storage provider destroys a 200GB file for example? What is an acceptable likelihood for that to happen in Safe Network? What should the replication count and chunk size be then?

Southside · February 1, 2024, 4:58pm

How is this implemented? How can we ensure chunks are directed to a variety of geographic locations?

Edit: read to the end of the thread before asking Qs thar have already been addressed

Toivo · February 1, 2024, 5:23pm

Another aspect of these calculations is that with the recent light nodes and node age gone, the thought has changed more to direction that it doesn’t matter if nodes come and go.

I think we should have some calculations what degree and rate of coming and going is OK. It seems to me that even quite low degree of coming and going might become problematic. But I’m far, far, far from competent to really calculate any of this.

EDIT:

It can also be thought this way: Every time the network shrinks for example 1% some chunks lose every copy of them. How much of that we can tolerate?

dirvine · February 1, 2024, 5:45pm

It is a feature of the xor address and hash of content. A node’s pub key is random. Same with data. So you have a random piece of data to a random node geography.

DavidMc0 · February 1, 2024, 6:24pm

My understanding is that this will be automatic, which is great, but would it be fairly easy to allow people to opt to ‘seed’ specific files?

This would mean someone could choose to host an entire specific file on their node as well as the usual random chunks. Kind of like voluntary mini-archive nodes for specific full files.

My thinking is that if this were fairly easy, public files people highly value would have many replications, ensuring ultra-secure availability, and this could possibly reduce network traffic as someone can download a specific seeded file from a few seeders vs coordinating with many thousands of nodes.

Probably unnecessary if caching and archive nodes achieve all of this, but wondered if there might be any merit in this idea? Won’t be offended if there isn’t!

Toivo · February 1, 2024, 8:09pm

I think the right way to think about data survival is something like this:

If the replication count is 5, then:

How often does it happen, that 5 or more nodes go offline at once?
When the 5 or more offline is repeated, how does the effect of repetitions accumulate?
How big portion of all nodes does that amount going offline represent?
What is the probability to have one or more missing chunks in a file of any given size after certain time?

Like:

5 or more nodes go simultaneously offline 100 times a day.
In 5 years this happens 100 x 365 x 5 = 182 500 times.
5 or more represents X % of all nodes.
I cannot calculate this.

All the numbers except 5 or more are speculation only.

But the point is, that all the shrinkage of the network, that is bigger than replication count, does add up.

But is that adding up significant or not?

DavidMc0 · February 1, 2024, 8:30pm

Should question 1 be ‘How often do 5 or more nodes simultaneously go offline, never to return?’.

Because, if even one node goes online again after a time, the chunk should be duplicated until it’s at 5 copies again, if I understand things correctly. If so, I expect it’d be very rare for 5 nodes to go offline permanently closely enough in time for replication to not kick in.

Toivo · February 1, 2024, 8:38pm

Yes.

And that depends how the encryption of storage is handled - I mean the encryption that is meant to protect the node runner from knowing the contents. There were at least some talks that it might be so that even a brief closing of a node would make it lose it’s chunks. Note, not just going offline, but turning off / closing the node - or was it the whole computer, wiping keys from RAM?

Anyway, these calculations, done well, should inform how that encryption should be designed, for example.

happybeing · February 1, 2024, 8:40pm

FYI: they are implementing the encryption in RAM at the moment, so the question is does the node shut down (rather than go offline which might mean it can come back online without restart). If it shuts down the chunks will be lost.

Toivo · February 1, 2024, 8:52pm

I made some calculations (on my phone, while doing somehting else), that made it look like a daily loss of… was it 3% of nodes would end up a 2 GB file to have about 82% possibility to survive after 5 years. That’s not good for a network meant to store data perpetually.

But the calculations were a drag to do, and they have 72.3454% probabilitiy of containing severe errors. (3% daily loss is probably too much too.) And I’m not going to do them again, as I am 99.889% certain that there are guys here in this forum, that can deliver 300% better results than me by just relaxing their sphincter.

I just want to draw the attention to the possibility that small losses may be significant over time, and that it should be examined mathematically.

I believe that the previous serious calculations were done when the mindset was that the network is more stable and growing stably. Now it looks like the mindset is more like “its growing, but can fluctuate while doing so”. But can it really?

Southside · February 1, 2024, 9:28pm

You mean @neo will fart the answer in morse code?

neo · February 1, 2024, 10:43pm

Are you sure? I did not think there was anything in the algo to examine IP address to ensure differing geographical location.

I thought it was just choosing the closest node ID to the chunk ID

Thus the chance of all 5 nodes being in the one geo location is the %age of nodes in that geo location to the power of 5. Then multiply this by number of chunks in a file will give you the expected number of chunks in any particular geo location with that %age of nodes.

Thus for china it might be 15% of all nodes and so chance for any chunk to have all copies in china is 0.15^5 == 0.0076% (approx) and for 10,000 node file it is 0.76 chunks on average. IE better than an even chance of losing the file.

for 7 copies for each chunk then this becomes 0.15^7 * 10,000 == 0.017 chunks. or 1.7% chance of a 10,000 chunk file being corrupted if a very large area shutdown forever with 7 copies of each chunk

BUT if @dirvine / @maidsafe devs decide to use a method mentioned elsewhere of “encrypting the temp key with a simplistic random key” then the temp key can be recovered when a node restarts within seconds of restarting. This way the chance of permanent loss of chunks is reduced even further.

This could be made a part of the client even, say for each 10 chunks you have X chunks extra. “X” represents (not twitter) but the “safety level” for the file. The reason to keep the number of chunks in each group to something like 10 is that the time to regenerate when missing a chunk multiplies as you increase the number of chunks in the group being “protected”

dirvine · February 1, 2024, 11:21pm

I think it will be. It makes the archive more chaotic and less deterministic, which is great in my book.

dirvine · February 1, 2024, 11:29pm

There isn’t. It’s like this

Network is global, so in every geography
a new node from any geography can end up in any address range

This get’s down to what do we call a geographic area.

Thsi assumes the outage was a total electrical wipeout or somthing as well.

This is very similar to increasing the replica count significantly AFAIK
However, if clients did this then I suppose it allows them to add even more replicas without having to touch the network code.

I am not sure I get this statement. Maybe I missed it earlier though?

neo · February 1, 2024, 11:54pm

In my understanding this is a simple addon with a lot of benefits including assisting data protection through to future archive nodes.

The poster was talking of malware which encrypts its malware so that virus checkers will not know what it is. The virus checker will see its encrypted and move on since it does not know how long it could take to brute force the decryption.

So the malware on startup brute forces the (weak/simple) encryption and goes to work.

Now the nodes use a temp key held in memory to secure the chunks and loses it when restarting.

If a weak encryption of the temp key along with the node’s required ID info was done and stored on disk then when restarting it could brute force the encrypted data taking a few seconds. Obviously the method/key used has to be simple/weak.

This would help against power outages, or windows forcing a restart after its updating, etc etc

It also would be a way archive nodes could go offline whenever needed and then resume later on. Simplifying the new code needed to make a archive node.

That is what i thought and what the calculations assume as its basis. And it is a very small chance for all nodes storing all copies of a chunk to be in a particular grouping. which works out to a 0.0000759375 chance for a large portion (15%) of the network and the example was 15% of network no matter the definition of the grouping of the nodes.

The grouping could be geographical, operating system version, starlink connections, one net connections, and so on.

yes i agree that 15% is highly unlikely in real life and used for illustration purposes

Similar but not same mathematically. It is a simple (future) addition to the client that is user configurable. The beauty is that dedup still works fine since the 10 chunks is still the same and the “X” chunks are extra. The code is something that could be added by a community contributor (or Maidsafe) at any time in the future and adds to the security of files held. And the “X” chunks can be added to existing files (new data map to hold extra chunks) by anybody who wants to upload the extra “protection” chunks. PAR files use this method where they add “parity” blocks to be able to reconstruct the original blocks. If the original blocks are valid then the extra “protection” blocks are not used.

neo · February 2, 2024, 12:12am

In the testnets this ratio is relatively high. In a tiny network it will be quite small and in small to mature network its extremely small. If each machine is running 20 or more nodes and 1 million machines then the number is 1 in 4 million. When its 100 million or more machines its 1 in 400 million or less.

But that is not the most efficient to calculate or visualise (in my opinion) the situation. since you require to know the number of times all those 5 nodes have the only copies of any particular chunk. Your visualisation does not and is not capable of showing this. You still need that chance of all 5 having the only copy of any particular chunk.

At this time each node uses a randomly generated key to encrypt all incoming chunks that it stores and decrypts when retrieving them. That key is only held in memory and is lost once its restarted.

Well not smelly. At least I hope not

But its is rather simple probability. Looking at the other idea of 5 simultaneous and how many times a year is almost impossible to work out since we cannot know if it ever happens aside from a grouping dropping out at once. That of course becomes easier to just work out the group dropping out rather than 5 nodes.

Toivo · February 2, 2024, 8:54am

I don’t think these calculations are correct.

If the file had 100 000 chunks, then the first calculation would give 7.6, which can’t be the case. Probability cannot exceed 1.

EDIT:

Bite the bullet and figured it out:

It goes like this: Probability for the above 15% example:

Correct. But:

For 10,000 node file, the probability for it to corrupt because of lost chunk, is the chance of the file NOT having ANY of it’s chunks in that 15% range. The probability of chunk NOT having all copies in 15% range is 1-0.15^5 == 0.999924. So the probability of NOT having ANY copies, is 1 - (the probability of it having one or more chunks in that range.)

Which is 1 - 0.999924^10 000 == 0,53 == 53%

So the probability of 10 000 chunk file to corrupt for the duration of 15% outage is 53%.

And as an example: For a 5% outage (replication 5, 10 000 chunks) it is: 0.003 == 0.3%

OK, it seems I have learned to calculate these. But I’m not into learning how to make sensible graphs etc. which I think would be needed. A tool where you could change the parameters and see the expected results in different scenarios.

neo · February 2, 2024, 12:08pm

That isn’t probability, that is number of chunks you would expect at a probability of 0.000076 for a 100,000 chunk file

Just like you expect a one to appear 100 times for rolling a dice 600 times

I gave probability of any chunk being in the 15% and you gave probability of not in 15%
Same thing expressed the other way around, both are valid.

Topic		Replies	Views
What are the chances for data loss? Beginners	72	5088	January 27, 2016
Nodes offline, copying chunks? Autonomi Network Token (incl (e)MAID)	3	797	June 28, 2017
What about a catastrophic event that wipes out millions of nodes Features	98	6099	February 15, 2018
Qty of files on the network Beginners	6	986	August 28, 2017
What happens to pieces of data if a device is shut down or it dies? Features	9	889	October 31, 2019

Calculating the probability of data persistence

Related topics