Proof Of Storage prototype
This post outlines a bit of a tinker inspired by this post by @JoeSmithJr (plus this clarification).
Background
Knowing how much spare space is available to the network helps match the supply of resources with the demand and minimize disruption during periods of high demand.
Mechanisms to prevent lying about spare space are subject to time inversions, where only a small portion of the claimed space is consumed and any ‘gaps’ are computed on the fly to pretend that space exists. This process is entirely hidden from the tester making proof of space an unreliable tool for measuring supply.
I previously considered this problem too easy to cheat for it to be useful, but maybe this isn’t the case…?
Protoype algorithm
The prototype presented here aims to provide a reliable measure of spare space. It stores a sequence of proofs that consume disk space.
A test can be periodically applied by asking for the proof at a specific point in the sequence. The user looks up the data and sends it back within a certain amount of time to prove they really are storing that data. Because the user doesn’t know which proof will be requested, they must store all proofs in order to respond in the required time.
If the user is trying to cheat by deleting some proofs, they might be missing the proof requested by the tester. The user needs to recalculate that proof (which requires data from some previous proofs). If any of those previous proofs are also missing there becomes a cascading effect and the calculation may take a long time. The more gaps in proofs the stronger the cascade effect.
Since all previous proofs have an equal chance of being required in the calculation, it’s not possible for a cheater to choose which ones are ‘safest’ to delete.
Implementation
See https://github.com/iancoleman/proof_of_storage. The code aims to be well documented via comments so I won’t duplicate the details of the mechanism here. Happy to answer specific questions of course.
The algorithm fills the drive up using a common seed value (eg the first block in the datachain for this section) and a derivation scheme that relies on sha3_256 hashes and ed25519 signatures (both are critical operations used in the normal operation of vaults so the filling doubles as a cpu test for those aspects of the network).
The algorithm can be ‘stretched’ via several parameters to make it take longer to calculate each proof. This means a cheater who needs to recalculate data can be detected because they would take much longer than a simple lookup. But making it too long per proof means the initial filling could take prohibitively long. So there’s a balance between easy-to-generate but hard-to-cheat.
The Honest User
How does the network (or any tester) test a user for how much spare space they have? First they supply the seed that allows the user to fill their drive up with a sequence of proofs. The user calculates consecutive proofs until they’ve filled up the space on their drive.
Once the proofs are calculated the tester can periodically ask for a random proof to ensure the space is still consumed by those proofs.
For example, the tester may request the proof at index 1004. The user looks up their proof at that point in the sequence and supplies the data to the tester. The supplied data should be the same value the tester has for proof 1004.
Since the data for 1004 could only have been calculated if all previous proofs exist, the tester can assume the user has 1004 proofs (each proof is 1 MB in size so they have at least 1004 MB of spare space). Since the user didn’t know which proof would be requested, the tester can assume all of the proofs up to 1004 must currently be avaiable on the users drive.
Running the prototype proof of storage code in real life shows an honest user can fill their drive at a rate of approx
39 MB per second (using an intel i7-7700 desktop with m.2 ssd; 26s for 1 GB of proofs)
24 MB per second (using an intel i7-4500U laptop with sata ssd; 42s for 1 GB of proofs)
To get an idea of what this is like compared to just plain writing to disk, the desktop can store 212 MB of random data per second. So it takes about 5 times longer to fill a disk with this algorithm than plain random data.
If the user takes more than a certain time to respond (lookup + network transfer), they were probably spending that time recalculating a missing proof and can be suspected of cheating.
If the user takes just slightly longer than the maximum allowable response time, were they cheating or did they just have a bad network connection? Ideally this doubt can be removed by making any cheating take a long time to respond, far longer than any variation in time to do a look+network.
The Cheating User
Consider a cheater that stores only even-numbered proofs and discards all odd-numbered proofs. They might claim to have 100 GB but in this case would only be storing 50 GB.
If the cheater is lucky and is requested to supply proof 1004 (which they have not discarded) they send it and appear to be honest.
But if they are requested to supply 1005, the cheater would be missing that proof. They would need to recalculate proof 1005, which depends on the data in some previous proofs, leading to a cascading calculation of any odd-numbered-proofs in the dependency chain.
Taking data from the example algorithm:
1005 depends on 711, which would be missing so they must calculate it.
711 depends on 389, which would also be missing, so they must calculate it.
389 depends on 88, which they'd have.
389 depends on 80, which they'd have.
389 depends on 260, which they'd have.
389 depends on 94, which they'd have.
389 depends on 238, which they'd have.
389 depends on 71, which would be missing, so they must calculate it.
71 depends on 52, which they'd have
71 depends on 28, which they'd have
71 depends on 58, which they'd have
etc... until 71 is fully calculated from 10 dependencies
389 depends on 321, which would be missing etc...
389 depends on 359, which would be missing etc...
etc... until 389 is fully calculated from 10 dependencies.
711 depends on 148, which they'd have.
711 depends on 43, which would also be missing, so they must calculate it.
etc... until 711 is fully calculated from 10 dependencies
1005 depends on 382, which they'd have.
1005 depends on 612, which they'd have.
1005 depends on 177, which would be missing so they must calculate it...
etc... until 1005 is fully calculated from 10 dependencies.
After fetching 10 dependencies (and their nested dependencies if any are missing) the proof of 1005 has been calculated and can be returned to the tester.
The nested dependencies means there’s a lot of work to do to recalculate the missing proof. If each proof takes a short while, the cheater will take too long to respond compared with just a simple fast lookup had they been honest and stored the proof like they claimed.
Concepts
If a user is 50% honest the network has a 50% chance of catching them on each test. If there are 10 tests then the chance of catching them is 99.9% (1-0.510).
The frequency of tests relates to the degree of honesty that should be expected.
The degree of honesty may change through time, possibly becoming extremely dishonest in a very short time.
Questions
Security
How much stretching should be applied (ie additional hashing / signing operations)?
What ratio of hashing:signing should be used for stretching? It probably depends on typical network load, since there will be more of one than the other.
Does verification of hashes / signatures have a role to play in the algorithm? Hard to see how it would since it doesn’t create any data so can’t be used for filling. But it would be nice to have since a lot of network activity is performing verifications.
How much depth should be used for the scavenging? (apologies for using lingo only appearing in the code).
Is it ok to have vaults prove a smaller amount intially and over time increase the amount they have available?
Can / should the stretching factor be dynamic?
Cheating
What degree of time / memory inversion is possible?
How can a cheater create an the optimum cheat-table? Something about the distribution of dependencies…
How much advantage do they get from this?
Does advantage from cheating compound over time?
Future proofing
How does the introduction of signature / hashing ASICs affect the algorithm?
How does this scale for users with a lot of spare space?
Can testers test vaults larger than themselves?
Broader context
How does this combine with measuring used storage to create a model for economics and incentives?
Is this any better than using sacrificial chunks?
Is consuming space with ‘useless’ data a good idea? (related question about bitcoin mining).
Intuition
My feeling is that network timing will totally dominate the response time, even if cheating happens. I think exploring what happens when the various consts
at the top of the code are changed is important to better appreciate how to make cheating computationally expensive without compromising the time it takes for the initial generation.
What Next
I think the algorithm works well enough to be useful. I want to create a cheater and see how much tweaking it takes to be able to differentiate between honest and cheating.
Any ideas or mistakes would be really interesting to hear about.