Calculating the probability of data persistence

Toivo · February 1, 2024, 11:51am

I started to wonder how does the mathematics work in case a bunch of nodes go offline. What is a chance, that a file loses a chunk? How does the size of a file, size of the network and replication count affect the probability of survival?

I found an old thread about this, but I’m not sure how relevant that is now that the design has changed so much.

From that thread I found a calculator by @to7m that can be used to calculate the survival rate of a file in a different scenarios. If I understood it correctly, I got the following result, for example:

There is a 99,9% chance of no chunk loss
When 5% of the network goes offline in one instant
When replication count is 5
And chunk size is 0.5MB
For a 1.6GB file.

OR

There is a 90% chance of no chunk loss
When 10% of the network goes offline in one instant
When replication count is 5
And chunk size is 0.5MB
For a 5.26GB file.

Are the above numbers and the calcualtions behind them correct?
@dirvine, was it so, that missing even a one chunk renders the file useless in all cases? Or could it be that is would be just glitch in a video, for example?

How does these numbers compare to other storing methods? It seems to me that quite small number of nodes going offline can cause problems for bigger files.

TylerAbeoJordan · February 1, 2024, 12:22pm

I believe so. Because of how self-encryption works - if I recall correctly. Self-encryption gives us deduplication though, so there is a trade-off.

I wonder how important deduplication is now (in the grand scheme of things) because of AI … which seemly has an insane ability to compress knowledge.

Toivo · February 1, 2024, 12:47pm

This then leads to situation, where the bigger number of chunks makes the file more vulnerable. I don’t know if the chunk size could vary based on the file size, to lessen the odds of the file becoming unusable. Maybe some chunks could be 10 or even 100MB?

EDIT: Copied from the calculator table, sorry for the formatting:

|prognosis of file:|0.999|
|sudden network outage:|0.01|
|copies per chunk:|8|
|chunk size (bytes):|10000000| (10 MB chunk size)
|prognosis per chunk:|1|
|file size for 0.999 prognosis:|90.11 EB|

TylerAbeoJordan · February 1, 2024, 12:51pm

No, that would mess with payments at the least.

Verification is an important final step to insure that it’s all uploaded correctly. But as you say the odds will get worse with larger files.

I wonder how important self-encryption is now though with AI. Is de-duplication all we gain from SE? or are there other benefits?

Toivo · February 1, 2024, 12:54pm

But it varies already, doesn’t it? Now even the very small files get divided into three chunks.

TylerAbeoJordan · February 1, 2024, 12:58pm

I don’t recall what that recent change was - it didn’t use padding? Don’t recall, even though I remember asking about it - will have to dig up the response. Even so, that would only mean small files would have smaller chunk and so end up paying more presumably. If larger files have larger chunks … how do you determine how big and how much to pay and then what about node size - can’t be fixed maybe … lots of problems.

Toivo · February 1, 2024, 1:12pm

Yes, and there was recently a discussion, where 0.5MB was deemed somehow ideal. But I don’t remember why exactly, but I’m sure the data persistence was not taken into account. In that sense raising the biggest chunk size to 1MB or 2MB would make difference.

I tried to find a good discussion about this all, based on mathematical facts, but was not able to find it. I’m not sure if this is really discussed, but I think now is the time, and I would like to see some folks more capable than me sharing their views. Paging @dirvine, @to7m, @mav.

TylerAbeoJordan · February 1, 2024, 1:18pm

I feel like it has been discussed a lot personally. Maybe not all in one space, but the ideas have been going around for some time.

Toivo · February 1, 2024, 1:24pm

I feel the same, that it has been discussed a lot, but without link to clear calculations.

TylerAbeoJordan · February 1, 2024, 1:29pm

BTW, here’s an old video on self-encryption - had to refresh my brain:

Toivo · February 1, 2024, 1:41pm

Also, I remember the discussion being more about how to cope with large outages. I’m also interested in thinking how even quite small outages can cause problems to big files.

I don’t know what are the industry standards, what is deemed as reliable?

For example the following calculation shows that even a 5% outage causes the survival rate to be lower that 0.99999%, for files bigger that 160MB. But is that good or not? Would be good to get some graphs by some folks with more knowledge.

File size for given prognosis
prognosis of file:	0.99999
sudden network outage:	0.05
copies per chunk:	5
chunk size (bytes):	5000000
prognosis per chunk:	0.9999996875
file size for 0.99999 prognosis:	160 MB

TylerAbeoJordan · February 1, 2024, 1:56pm

The probability stuff is outside my skill zone. I’ll tag @neo as he might have some insights.

Toivo · February 1, 2024, 2:03pm

To give another example with a file that is about the size of those used is some LLM’s:

File size for given prognosis
prognosis of file:	0.99
sudden network outage:	0.05
copies per chunk:	5
chunk size (bytes):	5000000
prognosis per chunk:	0.9999996875
file size for 0.99 prognosis:	160.8 GB

It seems that even 5% outage would give a 1% chance of a 160GB file to lose a chunk. That’s quite bad in my opinion. But I am not really able to put those numbers in a meaningful context.

dirvine · February 1, 2024, 2:55pm

I think it’s not as bad as that. If we constrain loss to be instantaneous (as it’s the near worst outcome) and permanent (the worst outcome) then some of the figures look bad.

However, the chunks are replicated over at least 5 replicas in differing geographies. So a 5% outage in a geography would not fit these assumptions above.

Then

Caching
Archive nodes etc.

We should make loss close to a zero probability, it can never be zero though, no matter what.

Chunks are information theoretically secure, i.e. even quantum AI could not crack them.

TylerAbeoJordan · February 1, 2024, 3:09pm

Yep, but not what I meant. I was inferring that deduplication saves space but with AI and LLM’s we have a new way to compress information that perhaps beats out deduplication, so self-encryption (which as I understand it allows dedup) isn’t needed in particular and maybe opens up the door to something more forgiving in terms of data loss. But self-encryption & dedup have long been a promise of SAFE, so I doubt they’re going away.

Toivo · February 1, 2024, 3:13pm

How is that done? I thought the location of the chunk is random.

dirvine · February 1, 2024, 3:17pm

The key is, so are the nodes placement. They are equally random.

Yes, this is where I am with it all, however until there is a perfect encoder (enconding data → vector database or similar) then it’s arguable the raw data needs to be available.

I belive we will get to that point at some stage, but I don’t see many working on it. Although we can merge weight/parameters of differing models now and with some help we could merge weights of all models regardless of architecture.

The issue then will be quantization levels and underlying algorithms, i.e. recently a new RRN outperformed an attention based NN.

TylerAbeoJordan · February 1, 2024, 3:39pm

@Toivo this is an important aspect that is left out of your probability assessment I think. It means not 5 copies, but 6 (with archive) or 7+ with cache depending on how heavily cached. What happens if you change the copy number to 6 in the calculation?

Toivo · February 1, 2024, 4:08pm

That doesn’t lead to geographical distribution of every chunk. The bigger the file the more probable it is that every replica of one of it’s chunks is for example in Finland.

dirvine · February 1, 2024, 4:16pm

It means the probability of all chunks existing in one geography reduces as the network grows The point I was making is a 5% outage does not mean 5% of a file is gone.

If we get the math model right here, set parameters such as loss==perpetual loss and so on then it will help.

So the probability as you point out that X can happen can be very low, so trying the same thing many more times increases the probability (your large file example) . It may be easier to focus just on the loss of a single chunk, regardless of any file size.

Topic		Replies	Views
What about a catastrophic event that wipes out millions of nodes Features	98	6157	February 15, 2018
Are Erasure Codes (Storj) better than Replication for the SAFE network? Features	109	6637	May 9, 2019
Storage proceeding Beginners	49	4944	October 5, 2015
What are the chances for data loss? Beginners	72	5128	January 27, 2016
Update May 26, 2022 Updates	26	2542	May 28, 2022

Calculating the probability of data persistence

Related topics