Is this deduplication process implemented yet? I ask because if I repeatedly rename and re-upload a largish file the process takes pretty much exactly the same time to complete on each occasion, whereas I’d expect it to be quicker after the first time since the network will realise an identical chunk already exist and thus not bother to store it again. Perhaps the upload has to happen in full anyway before the network can make that decision? Or, alternatively, is it the encryption process rather than the uploading process that is the time-consuming bit?
The concept of dedep is to save on storage (its inherent to SAFE) BUT not let the user know that the chunk already exists on the network.
Inherent to SAFE
This means that the network stores the chunk according to its contents, meaning that the fn(hash) of the contents is the address for storage. So 2 chunks with the same data will have the same storage address.
Not let you know it already exists
This is immutable data we are talking of. You will be charged the “PUT” cost. If done properly it will still upload the chunk, but no need to actually store it, so the client cannot deduce that the chunk already existed. Yes the charging for the chunk store has been confirmed.
Now the flaw is that an APP could be written to check prior to uploading if the chunks already exist. Simply do the process of self encryption on your PC and then request the chunk at that address to see if it exists. The problems with this is the time consumed and bandwidth used, so much so that people might prefer to pay the miniscule amount just to upload the file and be done with it.
Why hide that a chunk exists though? Does it really reveal anything? Isn’t it just a waste to upload data again if it already exists, instead of just hashing it on the client and only uploading if it doesn’t already exists? I think it would be nicer if I didn’t have to pay put costs for data that already exists on the network.
Each time you reveal something you potentially reduce anonymity/security, even if very small.
Lets say you gave a confidential document to just one person. Then that document appears on SAFE, and found out by de-dup reporting it. I am sure a ton of examples could be given as to how revealing this can lead to information leak.
Also in order to pay for the resources the network used to attempt to store the chunk the charge for the PUT continues and you need to not reveal the chunk exists. Otherwise people get a little snotty about paying for something they feel they should not have had to.
Also it would be nice not to have to pay for anything too
But dedup becomes a significant source of “income” for the network since over time more and more data will be duplicates. The network still has to perform functions when you try to upload a duplicate file, even the checking costs bandwidth. Also having “pay once” storage the costs of retrieving popular files is offset somewhat by dedup. I think its reasonable to assume that a lot of popular files will also be uploaded by more than one person.
So if you don’t want to pay then use an APP that wastes your time to see if the chunks exist prior to attempting to upload. Since reads are free then, it won’t cost you anything (except time/bandwidth) to check. But as I said before many will most likely just upload the file to save the bother. And yes it uses SAFE resources to do this but thats the tradeoff to give a free to read network.
I have to say it makes sense to charge even if the file is already on the network. The idea being if person A (original uploader) deletes there file, the file will stay as person B (second uploader) has paid for it to remain on the network. Extra work has to be done to verify it stays on the network, and that is what Bs payment goes towards.
Given that the average Maidsafe host is only going to have modest bandwidth at their disposal, there are going to be use-cases where higher availability is desired. Imagine hundreds of thousands of people trying to access the same video that has just gone viral… Wouldn’t it make sense for there to be (many) multiple copies of that high-demand data available?
In that case there will be a huge number supplying those chunks that make up the viral video.
Also remember that the network has 6-8 copies of each chunk stored in vaults.
When a file (vid) becomes popular then the built in network caching will be often supplying the chunks rather than getting the chunk from vaults. The more popular then the more caches that hold a copy
How can the system find duplicates in case if chunks are encrypted? Duplicates of chunks are possible, of course, but it is only a result of random data collisions: after processing the same chunk with different passwords - different data blocks would be retrieved. As a result - different data chunks might be a part of one file, and vise versa - same chunks might be a part of one file.
Good block cyphers are designed to produce near random output, so the collision expectation should by very small, if data chunks size is relatively big, so this feature might be not as good, as it might be in traditional cloud services.
If they are exact duplicates then the hashes are identical, so the network can identify them by the hashes without knowing the content.
The encryption of chunks follow an encryption method that encrypts the data using the data itself. Search out self-encryption. This way you only need the data map to decrypt a file, no other keys required so when sharing a private file you do not need to give any of your keys to the other person. You only need to give them the datamap. This is how all files are encrypted when stored on immutable storage. This way public files can be stored encrypted but anyone can read the whole file. The datamap is shared (made public) and so then anyone can read the file. But a vault cannot read the chunk since its encrypted.
As @Jabba says the address of the chunk is derived from its hash. Since self-encryption is used then 2 different people encrypting the same file will end up with exactly the same chunks and thus the same chunk addresses and this is how the network knows its a duplicate.