Storage proceeding

Artiscience · September 23, 2015, 6:26pm

I have some beginner questions with regard to the storage mechanics. I probably got parts of it wrong, so feel free to correct me.

(1) Let´s assume I have a 10MB file that I want to store on the SAFE network.

The file is encrypted and chopped into 10 chunks, each = 1 MB.
Each chunk is stored on 4 different vaults, that means that my initial file is now distributed over 40 different vaults.
Whenever I send a GET request the file is recomposed from the 10 chunks, depending on which vaults are online.

My question here is: what happens if 4 vaults go down which maintained the exact same chunk. In this case I´d expect that it is impossible to get the file? Or is the file pushed to another vault when one vault goes down?

(2) If it´s true that if vaults are down it is impossible to get the entire file, does that mean that large files are exponentially more dependent on a stable network?

(3) I have a 100KB file. How is it chunked?

janitor · September 23, 2015, 6:49pm

It is unlikely all 4 will go offline at the exact same time (meaning, before the chunk is rebuilt to at least 2x replica from one of remaining).

Yes, but there may be a copy in local cache, or as soon as one of the nodes comes back, you will be able to get the missing chunk(s). If you need very high availability, keep a local copy (although some old members keep saying that’s not necessary - see old discussions about local copies and backups).

Don’t remember the 100 KB scenario, I think you get a part of a chunk commingled with other people’s stuff, and pay proportionally. But this is probably wrong.

zankfrappa · September 23, 2015, 7:37pm

Wouldn’t this be a possible attack…

If a single entity controls X% of storage, then your chance of a single chunk residing only on vaults that entity controls is X^4.

So if someone owns 35% of the storage, a single chunk has a 1.5% chance of being stored only on their vaults. If that entity has 80% of storage then the chance of a chunk stored only on their vaults is 40%.

That entity can then shut off all their nodes simultaneously.

Consider that the file is split into 1MB chunks. Let N be the number of chunks and assume that losing one chunk destroys the file. Then the chance of losing the file is 1-((1-X^4)^N), assuming the entity is malicious.

Thus in this hypothetical, if you store a 100MB file and one entity is controlling 35% of the network, then the chance of losing the file is 78%.

It’s been some time since I’ve taken combinatorics and I don’t know details of the implementation, so I might be wrong and this is definitely a simplified analysis.

jreighley · September 23, 2015, 7:37pm

Actually vaults are not persistent anymore, so as soon as any one vault goes offline a new vault is immediately assigned the data.

happybeing · September 23, 2015, 8:15pm

I think the calculation is correct but that the inputs are not strictly as taken - we usually refer to four copies of each chunk but only for brevity. I think it is something more like 4 to 6 live copies, plus a number of offline copies. So the four is really a minimum of live copies at any one time and rather than the file being lost if all go down, it is still in tact, but not available until the missing chunk comes online.

So the calculation is I think correct, but the numbers unduly pessimistic.

Also, as the network grows, it becomes increasingly improbable that a single entity could control a large enough proportion of it to mount this kind of attack. While the network is small of course, yes it would potentially be feasible, but what would be the motive and who would have the resources to mount the attack - are the questions that I find hard to answer. It does though emphasise the importance of network size and a decent ramp in terms of adoption.

Artiscience · September 23, 2015, 8:52pm

That´s what I was asking for. That means that if I take my vault offline the other vaults hosting the chunks that the network is now lacking someone (who? the other vaults?) will identify the lack of copies and randomly assign a new vault where the file is stored.

Isn´t this a PUT request that is somewhat unpaid? And what happens once I sign into the network online again with my vault? Will there be just another copy? Or will the network delete the chunk from my computer?

jreighley · September 23, 2015, 9:08pm

Yes, when you go offline the other vaults in the neighborhood notice and send messages to the folks who need to pick up the slack created by your vacancy… The new “neighboring addresses” then request the chunks they need from the other vaults that have the remaining copies of them.

There are no PUT fees attached to this. Farmers are paid based on GETS not puts.

My understanding is that vaults are now non-persistant, and if you turn yours off you come up empty and with a new address next time. Non-persistent vaults - #6 by dirvine

Seneca · September 23, 2015, 9:09pm

The data managers.

In a way, yes. It’s not called a PUT though.

The chunk will be there on your computer but unless the network is rebuilding from a massive outage it won’t acknowledge the “old” chunks in your vault. Your vault software will wipe your vault contents and start from scratch.

Artiscience · September 23, 2015, 10:00pm

Thanks @jreighley and @Seneca for clearing things up. Re: 100Kb file? So this will be attached to another 900Kb package from someone else, as @janitor suggested?

Seneca · September 23, 2015, 10:15pm

No, chunks can be smaller than 1 MB. Down to 4 KB IIRC.

Artiscience · September 23, 2015, 10:16pm

Ok, then I guess my question is what happens if a file is smaller than the smallest chunk size.

Seneca · September 23, 2015, 10:20pm

I believe padding is added to get it to the minimal size.

digipl · September 23, 2015, 10:32pm

Actually MIN_CHUNK_SIZE=1024 so:

Very small files (less than 3072 bytes, 3 * MIN_CHUNK_SIZE) are not split into chunks and are stored entirety.

whiteoutmashups · September 23, 2015, 10:33pm

I thought David Irvine said it was combined into the datamap (3kb metadata) or something like that, if it’s smaller than chunk size. But IDK it was a long time ago and things change

anon81773980 · September 24, 2015, 12:40am

While we’re on the topic about the files, and how it potentially could lose data if all 4 went down, I want to extend on one certain thing that I haven’t yet find on this forum, corruption files.

Suppose vault 1 has file corruption, vault 2 goes down. Another vault takes it’s place. Vault 2 get the data from vault 1, which is corrupted. Now you got 2 corrupted files, and 2 non-corrupted files. If the non-corrupted file goes down, and takes a copy from 1 or 2. Once 3 vaults has corrupted, the 4th vault will be corrupted.

I was thinking what if we use btrfs “scrub” which takes the metadata, use it to compare to each file and ensure the integrity has not been corrupted. If it is, it will replenish by replacing the corrupted file to the clean non-corrupted file.

neo · September 24, 2015, 2:00am

They have changed, and even the chunk size issue. Not even sure if the 4K min is valid anymore. I (seem to) remember that there was talk of fixing the chunk size to 1MB to remove some complexity, but I sure cannot remember where. Maybe it is referred as 1MB for simplicity of discussion but I am sure that things have changed more recently. Guess when testing is done the chunk size (min/max) will be determined and fixed for the initial version of the network.

neo · September 24, 2015, 2:02am

Won’t happen because the chunk will not validate and thus vault 1’s chunk will be declared invalid and itself be replaced or removed.

janitor · September 24, 2015, 2:37am

Of course, everyone is aware of this and it was discussed on several occasions before.
The idea is once the network gets big it would take a lot of resources to get big enough a share.

If you have a RAID that loses 35% of disks, what happens? You lose data. So you may need a backup.

digipl · September 24, 2015, 7:48am

The file name of a chunk, and their position in the XOR space too, is based in the SHA512 own contents. So a corrupt chunk is detected immediately and eliminated.

About the chunk size this is the actual code:

/// Holds the information that is required to recover the content of the encrypted file. Depending
/// on the file size, such info can be held as a vector of ChunkDetails, or as raw data.
[derive(RustcEncodable, RustcDecodable, PartialEq, Eq, PartialOrd, Ord, Clone)]
pub enum DataMap {
/// If the file is large enough (larger than 3072 bytes, 3 * MIN_CHUNK_SIZE), this algorithm
/// holds the list of the files chunks and corresponding hashes.
Chunks(Vec),
/// Very small files (less than 3072 bytes, 3 * MIN_CHUNK_SIZE) are not split into chunks and
/// are put in here in their entirety.
Content(Vec),
/// empty datamap
None,
}

zankfrappa · September 24, 2015, 9:28am

People really knew that? @Artiscience asked “does that mean that large files are exponentially more dependent on a stable network?” and I think the answer is yes. Also there’s old discussion about controlling 75% of the network as being some sort of magic cutoff but that’s not quite accurate.

Tangentially related… there was discussion in an old thread about proof of redundancy (e.g., many vaults and one shared disk). I gather it’s not quite mathematically solved but may be impractical given the ranking system. However, I have this half-baked idea that maybe filling up a vault’s free space with some sort of junk data and occasionally asking them to hash something against that could be validation of their claimed unique storage. There would be a lot more to flesh out there to actually make it work… Or alternatively, always fill up a vault immediately upon joining the network (so increase redundancy to fill 100% of network storage). Then anyone trying to fake their redundancy will be found out eventually and penalized.

Topic		Replies	Views
What are the chances for data loss? Beginners	72	5130	January 27, 2016
Grouping vaults by owner, to strengthen redundancy? Features	30	3764	November 29, 2014
What about a catastrophic event that wipes out millions of nodes Features	98	6157	February 15, 2018
Sacrificial data vs non-permanent data Autonomi Network Token (incl (e)MAID)	27	3615	August 19, 2015
Calculating the probability of data persistence Development	146	1317	February 8, 2024

Storage proceeding

Related topics