I have some beginner questions with regard to the storage mechanics. I probably got parts of it wrong, so feel free to correct me.
(1) Let´s assume I have a 10MB file that I want to store on the SAFE network.
The file is encrypted and chopped into 10 chunks, each = 1 MB.
Each chunk is stored on 4 different vaults, that means that my initial file is now distributed over 40 different vaults.
Whenever I send a GET request the file is recomposed from the 10 chunks, depending on which vaults are online.
My question here is: what happens if 4 vaults go down which maintained the exact same chunk. In this case I´d expect that it is impossible to get the file? Or is the file pushed to another vault when one vault goes down?
(2) If it´s true that if vaults are down it is impossible to get the entire file, does that mean that large files are exponentially more dependent on a stable network?
It is unlikely all 4 will go offline at the exact same time (meaning, before the chunk is rebuilt to at least 2x replica from one of remaining).
Yes, but there may be a copy in local cache, or as soon as one of the nodes comes back, you will be able to get the missing chunk(s). If you need very high availability, keep a local copy (although some old members keep saying that’s not necessary - see old discussions about local copies and backups).
Don’t remember the 100 KB scenario, I think you get a part of a chunk commingled with other people’s stuff, and pay proportionally. But this is probably wrong.
If a single entity controls X% of storage, then your chance of a single chunk residing only on vaults that entity controls is X^4.
So if someone owns 35% of the storage, a single chunk has a 1.5% chance of being stored only on their vaults. If that entity has 80% of storage then the chance of a chunk stored only on their vaults is 40%.
That entity can then shut off all their nodes simultaneously.
Consider that the file is split into 1MB chunks. Let N be the number of chunks and assume that losing one chunk destroys the file. Then the chance of losing the file is 1-((1-X^4)^N), assuming the entity is malicious.
Thus in this hypothetical, if you store a 100MB file and one entity is controlling 35% of the network, then the chance of losing the file is 78%.
It’s been some time since I’ve taken combinatorics and I don’t know details of the implementation, so I might be wrong and this is definitely a simplified analysis.
I think the calculation is correct but that the inputs are not strictly as taken - we usually refer to four copies of each chunk but only for brevity. I think it is something more like 4 to 6 live copies, plus a number of offline copies. So the four is really a minimum of live copies at any one time and rather than the file being lost if all go down, it is still in tact, but not available until the missing chunk comes online.
So the calculation is I think correct, but the numbers unduly pessimistic.
Also, as the network grows, it becomes increasingly improbable that a single entity could control a large enough proportion of it to mount this kind of attack. While the network is small of course, yes it would potentially be feasible, but what would be the motive and who would have the resources to mount the attack - are the questions that I find hard to answer. It does though emphasise the importance of network size and a decent ramp in terms of adoption.
That´s what I was asking for. That means that if I take my vault offline the other vaults hosting the chunks that the network is now lacking someone (who? the other vaults?) will identify the lack of copies and randomly assign a new vault where the file is stored.
Isn´t this a PUT request that is somewhat unpaid? And what happens once I sign into the network online again with my vault? Will there be just another copy? Or will the network delete the chunk from my computer?
Yes, when you go offline the other vaults in the neighborhood notice and send messages to the folks who need to pick up the slack created by your vacancy… The new “neighboring addresses” then request the chunks they need from the other vaults that have the remaining copies of them.
There are no PUT fees attached to this. Farmers are paid based on GETS not puts.
My understanding is that vaults are now non-persistant, and if you turn yours off you come up empty and with a new address next time. Non-persistent vaults - #6 by dirvine
The chunk will be there on your computer but unless the network is rebuilding from a massive outage it won’t acknowledge the “old” chunks in your vault. Your vault software will wipe your vault contents and start from scratch.
Thanks @jreighley and @Seneca for clearing things up. Re: 100Kb file? So this will be attached to another 900Kb package from someone else, as @janitor suggested?
I thought David Irvine said it was combined into the datamap (3kb metadata) or something like that, if it’s smaller than chunk size. But IDK it was a long time ago and things change
While we’re on the topic about the files, and how it potentially could lose data if all 4 went down, I want to extend on one certain thing that I haven’t yet find on this forum, corruption files.
Suppose vault 1 has file corruption, vault 2 goes down. Another vault takes it’s place. Vault 2 get the data from vault 1, which is corrupted. Now you got 2 corrupted files, and 2 non-corrupted files. If the non-corrupted file goes down, and takes a copy from 1 or 2. Once 3 vaults has corrupted, the 4th vault will be corrupted.
I was thinking what if we use btrfs “scrub” which takes the metadata, use it to compare to each file and ensure the integrity has not been corrupted. If it is, it will replenish by replacing the corrupted file to the clean non-corrupted file.
They have changed, and even the chunk size issue. Not even sure if the 4K min is valid anymore. I (seem to) remember that there was talk of fixing the chunk size to 1MB to remove some complexity, but I sure cannot remember where. Maybe it is referred as 1MB for simplicity of discussion but I am sure that things have changed more recently. Guess when testing is done the chunk size (min/max) will be determined and fixed for the initial version of the network.
Of course, everyone is aware of this and it was discussed on several occasions before.
The idea is once the network gets big it would take a lot of resources to get big enough a share.
If you have a RAID that loses 35% of disks, what happens? You lose data. So you may need a backup.
The file name of a chunk, and their position in the XOR space too, is based in the SHA512 own contents. So a corrupt chunk is detected immediately and eliminated.
About the chunk size this is the actual code:
/// Holds the information that is required to recover the content of the encrypted file. Depending
/// on the file size, such info can be held as a vector of ChunkDetails, or as raw data.
[derive(RustcEncodable, RustcDecodable, PartialEq, Eq, PartialOrd, Ord, Clone)]
pub enum DataMap {
/// If the file is large enough (larger than 3072 bytes, 3 * MIN_CHUNK_SIZE), this algorithm
/// holds the list of the files chunks and corresponding hashes.
Chunks(Vec),
/// Very small files (less than 3072 bytes, 3 * MIN_CHUNK_SIZE) are not split into chunks and
/// are put in here in their entirety.
Content(Vec),
/// empty datamap
None,
}
People really knew that? @Artiscience asked “does that mean that large files are exponentially more dependent on a stable network?” and I think the answer is yes. Also there’s old discussion about controlling 75% of the network as being some sort of magic cutoff but that’s not quite accurate.
Tangentially related… there was discussion in an old thread about proof of redundancy (e.g., many vaults and one shared disk). I gather it’s not quite mathematically solved but may be impractical given the ranking system. However, I have this half-baked idea that maybe filling up a vault’s free space with some sort of junk data and occasionally asking them to hash something against that could be validation of their claimed unique storage. There would be a lot more to flesh out there to actually make it work… Or alternatively, always fill up a vault immediately upon joining the network (so increase redundancy to fill 100% of network storage). Then anyone trying to fake their redundancy will be found out eventually and penalized.