Calculating the probability of data persistence

Toivo · February 2, 2024, 12:17pm

You are right!

I seem to be too, but I still need to redact my statement:

to7m · February 4, 2024, 10:44pm

It isn’t relevant where the encryption happens.

to7m · February 4, 2024, 11:12pm

@dirvine, as the node addresses are random, your initial claim about geographical distribution is misleading. As far as I can see, geographical distribution is not relevant to what is being discussed here. The same maths applies to an event that wipes nodes in a geographically-specific way (like a meteor) as to an event that wipes nodes in a geographically-neutral way (like malware), as 5% of all nodes is 5% of all nodes regardless of where they are hosted.

@toivo, thanks for paging me, glad you used the calculator! You seem to be understanding it correctly. It is indeed specifically for events where all chunks on given nodes are simultaneously destroyed (like for meteors and coordinated malware), rather than events that cause nodes to temporarily shut down without wiping data.

My hope was to limit its impact of inevitable data loss. I’m still convinced that self-encryption over whole files is a bad idea.

If we keep the model of a entire file depending on every single one of its chunks, then we’ll need a certain number of copies for each chunk to keep everyone happy. The exact number isn’t relevant to this argument. If we decreased that number by 1, and switched to a format that allowed partial file reconstruction, then that would be a much better outcome as far as I can see. Fewer files would be fully-reconstructable, but overall chunk loss would be significantly lower, and storage would be cheaper.

neo · February 4, 2024, 11:30pm

This also includes such things as (in no particular order)

zero-day malware taking out large number of machines
Operating System (say bug in windows update)
Windows update increases number of machines restarting at around the same time (ie before replication is given enough time to significantly complete)
People who shut off their machines around bedtime. Similar but maybe worse effect to windows updates
- IE in each timezone where a lot of PC turn off during a 2 hour period.
ISP with boarder router issue.
ISP with an event taking down its network.
- In AU our 2nd largest ISP went completely down for 8 - 12 hours. No internet for over 1/3rd of Aussies. If Nodes could restart recovering their temp encryption key then the lost chunks would return.
- most people first trouble shooting is to “Did you turn off and turn back on all your equipment”
and so on

Now nodes encrypt the chunks when stored to disk and decrypt them when retrieving. This is to prevent any other app on the computer reading the chunks in case there are chunks that did not get self encrypted.

This temp key at this time is only kept in memory and lost as soon as the node stops, thus all the chunks on disk are useless.

This makes the simultaneous drop out even more a danger and why I believe that the temp key must be able to be stored on disk in a safe manner that is not immediately readable. See my later replies to David

dirvine · February 4, 2024, 11:34pm

I am not sure what you mean here. If nodes were distributed locally, then 5% could take out every copy of every chunk in that locality.

I think perhaps the point is this.

Nodes distributed globally and randomly aids in the safety of replica copies from local network disturbances. This is due to the fact the other replica copies are not in that locality and work away fine creating new replicas when they see the local disturbance. They will see the disturbance as they are logically close, but not be negativly affected as they are geographically apart.

neo · February 4, 2024, 11:43pm

The issue wasn’t to make nodes locally in a geo location but there is no mechanism (rightly so) to make sure they are NOT geo-local for close nodes.

But the issue is that while its a tiny 0.000076 chance of 5 node being in a 15% section of the network (whatever grouping you define) the problem blows out once the file is large. For 3 chunk files there is a 1 in 13,000 chance that it will be affected on a 15% network loss. But for a 13,000 chunk file it is considered a statistically certainity that the file will have one chunk lost.

While this is smaller for say a 5% of the network being lost, its made worse because of the temp key in memory since even a sub second power loss losses all those chunks. How much worse is small but still significant since a large area dropout (eg a city wide power glitch) will not restart with their chunks intact.

I implore you to implement the safe storage of the temp key & any required info so that the node can restart where it left off. This also makes archive nodes more easily restartable. Maybe using the weak encryption of the temp key and brute forcing it on restart. That would mean the chance of losing a large file is extremely small requiring something like China shutting off its internet for a long time.

to7m · February 4, 2024, 11:46pm

A lot of those events aren’t truly simultaneous. The bedtime one for instance would just mean a node goes offline and then copies are made to compensate for that. And a lot of those events don’t include permanently wiping the chunks, so that’s not true data loss.

The temp key thing only seems relevant for very brief operations where there is data in RAM. Those operations could just be restarted if they fail, unless I’m missing something here.

dirvine · February 4, 2024, 11:51pm

This is where randomness solves that issue for us. Trying to control that leads to places that dragons live

I don’t feel we have a good solition here. The ying yang is this

no encryption
Hackers try and store bad data on nodes to poison the network

in ram key
On reboot you lose the data you held.

safe storage of the temp key
I am still not getting this, but if it’s a key on disk that is guessable then it’s the same problem as no encryption as folk will provide apps to read the data and we are back at square 1 again.

As I say I am not 100% on any track just yet, but it’s not my decision either. It’s worth chatting for sure. I was looking at encrypted volumes and such like, all messy and crappy. I have not found a simple solution here.

Another thing I was pondering was chunks that are so small you could not really hide any bad images and certainly not video in them. Then perhaps we can use entropy checks for detecting text, but only plain text and many text formats have a lot of entropy, but then you can check for file header information, but then again random data will sometimes have valid file header looking info too.

So a few things to think of.

BTW I doubt archive nodes would delete all data on reboot.

neo · February 4, 2024, 11:53pm

Its statistically significant though. I was saying the chance of having larger %age of nodes simultaneous is higher.

It doesn’t require an exact simultaneous shutdown to be an effective simultaneous shutdown. Replication of a node’s chunks on shutdown of the node takes time. Maybe 10 to 20 minutes and if a chunk is only within that group then there is a fair chance of loss. Over a year this could represent a number of lost files.

But if the node was able to restart and either continue on or replicate out its chunks then wide its storage and restart with new ID then the chunks have a excellent chance of never being lost.

No that isn’t the purpose of the temp key. It makes sure the data on disk is unreadable, so on restart the chunks it stored are permanently lost

to7m · February 4, 2024, 11:55pm

Nodes distributed globally and randomly aids in the safety of replica copies from local network disturbances.

If you have random distribution of chunks, then there is a chance for any given chunk that all of the copies will happen to be held in, say, London. If London is destroyed, the chunk is gone. The alternative of geographically-aware distribution would mean that chunks are never stored in the same place, and London being destroyed wouldn’t be a problem for the chunks.

to7m · February 5, 2024, 12:02am

Nodes can restart and resurrect any chunks they have (to the best of my knowledge), which is why I’m not concerned about a lot of your scenarios. The big ones to me are large-scale physical devastations (meteors, nuclear strikes) and coordinated malware attacks.

to7m · February 5, 2024, 12:03am

No that isn’t the purpose of the temp key. It makes sure the data on disk is unreadable, so on restart the chunks it stored are permanently lost

Ah okay. I haven’t come across this before. This sounds crazy so I’m hoping it’s not the case.

neo · February 5, 2024, 12:03am

So true

Not really. The safe storage of the key means that the chance of another app reading the chunks is still reduced to almost zero. Very much a positive. You’re 99% the protect of only temp key only in RAM.

But for that 1% where a hacker tries to get it then Nope almost the same as key only in RAM.

The hacker is assumed to have access to the PC either physically or through something like malware or other autonomous means.
The hacker scans memory and grabs the temp key. Having it only in RAM did not protect.
The hacker kills off the good nodes and runs a modified node software and sees all it can, no node-side encryption.

So in effect the only thing the temp key safely store is the tiny chance another app knows how to get it. But if you go hacker route then they can get the temp key or bypass it anyhow.

But if they use the temp key, for the same protection reason the other node do then yes they would. To leave archive nodes without encryption is to not only make them a target but a good target for anyone wanting to grab large number of chunks that were not encrypted. (eg court orders)

This is why I am pushing for that temp key to be safely stored on disk as well to allow the node to restart and either continue or distribute those chunks before starting as a new node.

dirvine · February 5, 2024, 12:13am

If the key is on disk then the chunks can be considered not encrypted, surely?

In that case it’s all bets off, they can do anything.

dirvine · February 5, 2024, 12:15am

To be clear, what I means here is.

Bad (possibly state) actor stores a lot of bad (what is bad etc. yada yada) data on the network.
Same actor releases an app that says, check your SAFE node for bad data, if you are storing it you are breaking the law and we will jail you etc. etc.

to7m · February 5, 2024, 12:19am

As I understand it, nodes do not encrypt the chunks. Chunks are encrypted before they reach the node. @dirvine is that correct?

edit: The answer was, in most cases, no this is not correct.

dirvine · February 5, 2024, 12:23am

They are via the client API, but there is a low level API on the network that allows you to store any chunk. So as long as the content hashes to the name (key) then it’s valid. There is no simple way to avoid that.

to7m · February 5, 2024, 12:26am

Is that a yes?

dirvine · February 5, 2024, 12:29am

Currently we have the ability for nodes to create a temp encryption key in RAM and encrypt each chunks stored to disk. This is to prevent any app reading the on disk data.

Default and until recently this did not happen.

However any client app using the client API has the data encrypted before it leaves the client.

You can bypass the client API and this is where the issue is.

neo · February 5, 2024, 12:32am

For most (over 99%) of apps and situations then yes it can be considered encrypted.

Damn you just explained the same as me. In this case then it matters not that the key even exists.

They can have their app take the temp key from memory directly
They could run their own nodes without the temp key encryption
yada yada.

The temp key encryption only protects from dumb/lazy bad people and good people. Those type of bad people represent over 99% of bad people who are just trying to read data off a disk and see what they can.

The bad people who could get at the key safely stored on disk can also just change the node software that is run to a modded one without the temp key at all.

Basically storing the temp key safely on disk does not practically change the security of those chunks. Anyone wanting the chunks will have a way to do it without worrying about the temp key on disk. Grab the in memory copy or mod the node app on the PC to not use it.

By safely storing the temp key using the suggestion made by another poster was similar to what some malware did and is not readable by virus checkers either.

temp key is still a very strong key with secure encryption being done on the chunks
encrypt the temp key (and any other Node ID info etc) with a weak encryption and store on disk.
on restart the node brute forces the temp key & node info.
- this is the method that the malware uses.
then the node either distributes the chunks then starts afresh with new ID or resumes operations as the old node.

Basically the temp key implementation was for an edge case where the data on disk was easily readable by other APPs if the client did not encrypt the chunk. Or for attackers. While very important for accidental or curious to bad people, it was still an edge case to close that “loophole”. Hackers or knowledagable bad people it is ineffectual since the key can be read from memory or bypassed.

Topic		Replies	Views
What about a catastrophic event that wipes out millions of nodes Features	98	6269	February 15, 2018
What are the chances for data loss? Beginners	72	5244	January 27, 2016
Are Erasure Codes (Storj) better than Replication for the SAFE network? Features	109	6713	May 9, 2019
Storage proceeding Beginners	49	4988	October 5, 2015
Possible vulnerabilities Support	17	392	December 13, 2024

Calculating the probability of data persistence

Related topics