Calculating the probability of data persistence

I started to wonder how does the mathematics work in case a bunch of nodes go offline. What is a chance, that a file loses a chunk? How does the size of a file, size of the network and replication count affect the probability of survival?

I found an old thread about this, but I’m not sure how relevant that is now that the design has changed so much.

From that thread I found a calculator by @to7m that can be used to calculate the survival rate of a file in a different scenarios. If I understood it correctly, I got the following result, for example:

There is a 99,9% chance of no chunk loss
When 5% of the network goes offline in one instant
When replication count is 5
And chunk size is 0.5MB
For a 1.6GB file.

OR

There is a 90% chance of no chunk loss
When 10% of the network goes offline in one instant
When replication count is 5
And chunk size is 0.5MB
For a 5.26GB file.

Are the above numbers and the calcualtions behind them correct?
@dirvine, was it so, that missing even a one chunk renders the file useless in all cases? Or could it be that is would be just glitch in a video, for example?

How does these numbers compare to other storing methods? It seems to me that quite small number of nodes going offline can cause problems for bigger files.

8 Likes

I believe so. Because of how self-encryption works - if I recall correctly. Self-encryption gives us deduplication though, so there is a trade-off.

I wonder how important deduplication is now (in the grand scheme of things) because of AI … which seemly has an insane ability to compress knowledge.

2 Likes

This then leads to situation, where the bigger number of chunks makes the file more vulnerable. I don’t know if the chunk size could vary based on the file size, to lessen the odds of the file becoming unusable. Maybe some chunks could be 10 or even 100MB?

EDIT: Copied from the calculator table, sorry for the formatting:

|prognosis of file:|0.999|
|sudden network outage:|0.01|
|copies per chunk:|8|
|chunk size (bytes):|10000000| (10 MB chunk size)
|prognosis per chunk:|1|
|file size for 0.999 prognosis:|90.11 EB|

3 Likes

No, that would mess with payments at the least.

Verification is an important final step to insure that it’s all uploaded correctly. But as you say the odds will get worse with larger files.

I wonder how important self-encryption is now though with AI. Is de-duplication all we gain from SE? or are there other benefits?

2 Likes

But it varies already, doesn’t it? Now even the very small files get divided into three chunks.

1 Like

I don’t recall what that recent change was - it didn’t use padding? Don’t recall, even though I remember asking about it - will have to dig up the response. Even so, that would only mean small files would have smaller chunk and so end up paying more presumably. If larger files have larger chunks … how do you determine how big and how much to pay and then what about node size - can’t be fixed maybe … lots of problems.

1 Like

Yes, and there was recently a discussion, where 0.5MB was deemed somehow ideal. But I don’t remember why exactly, but I’m sure the data persistence was not taken into account. In that sense raising the biggest chunk size to 1MB or 2MB would make difference.

I tried to find a good discussion about this all, based on mathematical facts, but was not able to find it. I’m not sure if this is really discussed, but I think now is the time, and I would like to see some folks more capable than me sharing their views. Paging @dirvine, @to7m, @mav.

3 Likes

I feel like it has been discussed a lot personally. Maybe not all in one space, but the ideas have been going around for some time.

2 Likes

I feel the same, that it has been discussed a lot, but without link to clear calculations.

1 Like

BTW, here’s an old video on self-encryption - had to refresh my brain:

Also, I remember the discussion being more about how to cope with large outages. I’m also interested in thinking how even quite small outages can cause problems to big files.

I don’t know what are the industry standards, what is deemed as reliable?

For example the following calculation shows that even a 5% outage causes the survival rate to be lower that 0.99999%, for files bigger that 160MB. But is that good or not? Would be good to get some graphs by some folks with more knowledge.

File size for given prognosis
prognosis of file: 0.99999
sudden network outage: 0.05
copies per chunk: 5
chunk size (bytes): 5000000
prognosis per chunk: 0.9999996875
file size for 0.99999 prognosis: 160 MB
2 Likes

The probability stuff is outside my skill zone. I’ll tag @neo as he might have some insights.

2 Likes

To give another example with a file that is about the size of those used is some LLM’s:

File size for given prognosis
prognosis of file: 0.99
sudden network outage: 0.05
copies per chunk: 5
chunk size (bytes): 5000000
prognosis per chunk: 0.9999996875
file size for 0.99 prognosis: 160.8 GB

It seems that even 5% outage would give a 1% chance of a 160GB file to lose a chunk. That’s quite bad in my opinion. But I am not really able to put those numbers in a meaningful context.

4 Likes

I think it’s not as bad as that. If we constrain loss to be instantaneous (as it’s the near worst outcome) and permanent (the worst outcome) then some of the figures look bad.

However, the chunks are replicated over at least 5 replicas in differing geographies. So a 5% outage in a geography would not fit these assumptions above.

Then

  • Caching
  • Archive nodes etc.

We should make loss close to a zero probability, it can never be zero though, no matter what.

Chunks are information theoretically secure, i.e. even quantum AI could not crack them.

11 Likes

Yep, but not what I meant. I was inferring that deduplication saves space but with AI and LLM’s we have a new way to compress information that perhaps beats out deduplication, so self-encryption (which as I understand it allows dedup) isn’t needed in particular and maybe opens up the door to something more forgiving in terms of data loss. But self-encryption & dedup have long been a promise of SAFE, so I doubt they’re going away.

2 Likes

How is that done? I thought the location of the chunk is random.

1 Like

The key is, so are the nodes placement. They are equally random.

Yes, this is where I am with it all, however until there is a perfect encoder (enconding data → vector database or similar) then it’s arguable the raw data needs to be available.

I belive we will get to that point at some stage, but I don’t see many working on it. Although we can merge weight/parameters of differing models now and with some help we could merge weights of all models regardless of architecture.

The issue then will be quantization levels and underlying algorithms, i.e. recently a new RRN outperformed an attention based NN.

2 Likes

@Toivo this is an important aspect that is left out of your probability assessment I think. It means not 5 copies, but 6 (with archive) or 7+ with cache depending on how heavily cached. What happens if you change the copy number to 6 in the calculation?

1 Like

That doesn’t lead to geographical distribution of every chunk. The bigger the file the more probable it is that every replica of one of it’s chunks is for example in Finland.

1 Like

It means the probability of all chunks existing in one geography reduces as the network grows The point I was making is a 5% outage does not mean 5% of a file is gone.

If we get the math model right here, set parameters such as loss==perpetual loss and so on then it will help.

So the probability as you point out that X can happen can be very low, so trying the same thing many more times increases the probability (your large file example) . It may be easier to focus just on the loss of a single chunk, regardless of any file size.

3 Likes