Self encryption breaks a file up into chunks before being uploaded. The entire file becomes corrupt if one of those chunks goes missing. I’m interested in techniques to recover from this sort of issue.
Let’s put aside the existing high security of chunks via redundancy and data chains, and consider how users themselves may further reduce the chance of corruption or loss of their files on the network.
The most promising technique I found is Parchive which creates .par files used to recover missing data (originally designed for usenet). From the par2 man page:
If test.mpg is an 800 MB file, then par2 will create a total of 8 PAR2 files.
test.mpg.par2 - This is an index file for verification only
test.mpg.vol00+01.par2 - Recovery file with 1 recovery
test.mpg.vol01+02.par2 - Recovery file with 2 recovery blocks
test.mpg.vol03+04.par2 - Recovery file with 4 recovery
test.mpg.vol07+08.par2 - Recovery file with 8 recovery blocks
test.mpg.vol15+16.par2 - Recovery file with 16 recovery blocks
test.mpg.vol31+32.par2 - Recovery file with 32 recovery blocks
test.mpg.vol63+37.par2 - Recovery file with 37 recovery blocksThe test.mpg.par2 file is 39 KB in size and the other files vary in size from 443 KB to 15 MB.
These par2 files will enable the recovery of up to 100 errors totalling 40 MB of lost or damaged data from the original test.mpg file
This seems quite efficient, and the degree of recovery can be tailored (eg instead of 40 MB recoverable it could be 80 MB, and instead of 100 errors it could be more). If you only need a little bit of recovery you only need to down load a small par file. It’s really quite elegant.
It could easily be included by users as a second upload beside the original file, since it simply generates .par files which can also be uploaded to the safe network like any other file.
There are some other options which are quite interesting.
Shamir Secret Sharing Scheme which requires M parts of N total parts to be able to recreate the file. But this is much less efficient than most parity-style systems.
RAID also implements parity checks, but is at the filesystem level rather than the file level.
Reed-Solomon codes as used on CDs which is very efficient (1 parity bit for 3 data bits)
There’s a wealth of information on Error Correction and Forward Error Correction on wikipedia
In summary, I think there’s no need to add compulsory parity checks at any layer of the network, but it’s definitely something of interest as a second layer of security for files that users may opt into if it suits their needs.
Do you have any other thoughts on users adding extra layers of security to their data?