Data on SN is replicated, but what if we can do better? More reliability, less overhead? That’s what Information Dispersal Algorithms are for.
This is well established work since it was originally published in 1989 by Michael Rabin in the paper Efficient dispersal of information for security, load balancing, and fault tolerance.
The current early code for this is in dirvine/rabin_ida
From the paper, some key points to appreciate are:
The storage and transmission of data tiles in distributed systems gives rise to significant security and reliability problems.
An obvious countermeasure against loss of files is to store copies at other network nodes. If we consider, say, a total of five copies to be necessary for the required safety margin, then this entails a fivefold blowup of the total storage in the system [which the IDA algorithm is trying to avoid / solve / improve].
we can create k copies of the file, select k paths connecting node A to node B, and send a copy along each path. This will result in k-fold increase in network load [which the IDA algorithm is trying to avoid / solve / improve].
We propose a method, the Information Dispersal Algorithm (IDA), of reliably dispersing the information in F into n pieces or locations. The file can be reconstructed from any m pieces. The salient point is that each piece is of size
|F|/m
, where|F|
is the size (number of characters) of F. Consequently, the total number of characters is n/m ⋅ |F|. Since we can choose n so that n/m ~ 1, our dispersal method is space efficient. Also, the dispersal and reconstruction are computationally efficient.
Some things that are worth understanding a bit better which I’ll slowly try to understand (or if anyone has insights please chip in!):
Does IDA replace existing parts of the network logic or is it improving and extending it? (seems like it doesn’t replace anything, just improves it, see Update May 26)
How is this algorithm different from shamir sharing?
How is this algorithm different from erasure coding such as Reed-Solomon codes?
Is this algorithm used by clients, or is it used by nodes? (seems like both, see Update May 26)
Is this used before self encryption, after self encryption, or instead of self encryption?
How do we choose m
and n
? What are the tradeoffs? (seems to be related to elder size and thresholds, see Update May 26)
Some existing information from maidsafe about IDA:
we’re not proposing replacing replication with an IDA
The change is that adults will be able to split chunk A into seven pieces using an IDA and pass just one piece back to the elders on request, rather than the whole chunk, meaning that much less data is exchanged.
IDA is a different beast from threshold cryptography, designed for optimising data storage and transfer rather than secret sharing.
Here’s a quick and easy paper on it https://www.eecs.harvard.edu/~michaelm/TALKS/RabinIDA.pdf
IDA was also mentioned back in 2014 in this post
IDA (Rabin - Information Dispersal)-> splits up data into N of P (N == number of shares, P == parts required to make up N). Its relatively efficient and the data/P is pretty close to efficient (similar to how RAID works). It is not secret though.