Data Deduplication

sprucely · April 19, 2017, 4:56pm

An associate browsed the MaidSafe.net features page and sees the deduplication as a safety concern. I think he is working under some misconceptions/assumptions about how the system works. Here’s the specific concern…

The system having the ability to identify identical data segments negates the inherent security of those data segments. In the statement, it makes it seem as if no one knows where the segments are when in fact, they are closely tracked and managed by the system. Any file you store that another person also stores will be identified as such, allowing authorities to identify users who store the same file simply by monitoring the deduplication process.

My guess would be that there is no “deduplication process” to monitor because deduplication is a simple consequence of how chunks are routed to locations in XOR space based on their hash codes, which are calculated before they even enter the network. Also, I believe there is no way for authorities to monitor a user’s activity, apart from the user’s computer already being compromised, since each time a user logs on they are assigned a different XOR address.

Are my statements correct, and do they fully address his concern?

happybeing · April 19, 2017, 6:53pm

You are correct, he’s making assumptions on how deduplication is achieved that are not valid.

Files are ‘chunked’ and then encrypted before they leave your machine, and only you have a map that can be used to relate those chunks together.

Once on the network those chunks end up at random locations and only someone who has a data map for the particular file can know they correspond to that particular file, and nobody can tie a chunk on the network to a user (or file) without first having that same data map (for the particular file).

neo · April 20, 2017, 12:48am

Saying what @happybeing said differently

You store a file

it is split and self encrypted in chunks
each chunk is sent to the network to be stored
for each chunk you pay the network to do the store
you have the datamap (list of chunk addresses obtained from the self encryption process IIRC)
the network does the storing
if the chunk already exists (duplicate chunk) then no store happens
you are returned stored_OK type of response. There is no indication that it already existed and no refunds.

In other words there is no response back to you that the chunk existed or not and your chunk store appears exactly the same whether the chunk previously existed or not.

EDIT: for the no refunds being confirmed see Data Deduplication - #15 by frabrunelle

happybeing · April 20, 2017, 7:08am

Thanks @neo a good clear more detailed description.

Are you sure about charging when already stored? My understanding was maidsafe still plan to charge, or at least that this is not yet decided.

neo · April 20, 2017, 8:15am

That was declared by David as Yes it will.

Some of the reason being that if you don’t charge then

it tells you that the file (chunks) exist and reduces security.
the complexity to “refund” the amount make the network more complex.
- something to do will having to pass up the line to refund the amount.
- the network has done the work for processing the chunk except the very last step of actually storing it
also by charging each time it doesn’t favor one person over another. Everyone treated the same in storage costs.
Improves the network economics

Of course this doesn’t consider a technique completely separate to de-duplication is to write an APP that does the self encryption without trying to store the file and then requesting the chunks. This tells you if the chunks exist without needed to attempt to store the chunks.

tfa · April 20, 2017, 10:54am

Master branch of safe_vault already implements the refund when put data already exists in the network.

Edit : As indicated below by @frabunelle, this is only true for SD and MD and not for immutable data.

drehb · April 20, 2017, 11:47am

Hmm, so isn’t it a DoS attack vector to keep putting the same data? It won’t take up more network space, but would take processing time (and bandwidth?). And you could do it for free.

neo · April 20, 2017, 12:15pm

I didn’t think it was up to the vault component to issue the refund back to the “PUT” balance.

tfa · April 20, 2017, 4:34pm

Yes, it is. Account balances are managed by MaidManager persona of vaults, including refund when a put is unsuccessful because the data already exists.

Edit : As indicated below by @frabunelle, this is only true for SD and MD and not for immutable data.

Traktion · April 20, 2017, 6:57pm

Perhaps a partial refund would be better.

Southside · April 20, 2017, 6:59pm

Too complicated - would inevitably involve “magic numbers” that @dirvine dislikes.

But I’m willing to be convinced otherwise?

Traktion · April 20, 2017, 7:21pm

I suspect we can estimate cost of checking for existing data vs saving the data, then run with it. The main point is that it should cost something no matter how small. People don’t like spending money to do DoS generally.

Southside · April 20, 2017, 7:25pm

Totally agree.
I tend toward the view that if you want to PUT something, it should cost, whether or not it actually results in a new PUT.
“Profit” from this to go to general funds split between farmers, PtP whatever.

Anything, just anything that results in fewer pictures of kittens on teh intrawebs…

But absolutely no freebies for something that could get used as a DoS vector.

neo · April 20, 2017, 11:43pm

Thanks for that.

I agree with this. If you commit to something then its no loss to you that the system has already stored it for someone else.

That too.

But still (if I wasn’t in a hurry) I would use an APP that checks if the chunks exist already. So it would do something like take my 20GByte file and self encrypt the first 10 chunks and then try and retrieve them from the safenetwork. If they don’t exist then unlikely the rest do and so upload the file in the normal manner. If the first 10 exist then good chance the rest do so it encrypts more and checks say the last 10 and so on. Thus I don’t even ask the network to store the data and cannot be asked to pay and the network hasn’t done 20GB worth of attempted store.

frabrunelle · April 21, 2017, 4:00pm

Yes, but that’s only for Structured Data and Appendable Data.

If a chunk of ImmutableData already exists, the Vaults send a PUT success (no refund):

https://github.com/maidsafe/safe_vault/blob/310c363ea379cb4d608d065d9007fad52ed97efa/src/personas/data_manager.rs#L482-L488

Immutable Data offers no refund because those chunks aren’t owner specific, so the user gets a PUT success. This isn’t the case for SD/AD and thereby the refunds.

happybeing · April 21, 2017, 4:49pm

High five to @neo - right again man!

neo · April 21, 2017, 11:52pm

Actually I was repeating what David said so it should be high five to @dirvine

And thanks to @frabrunelle for tracking it down

happybeing · April 22, 2017, 11:18am

Too modest for an Aussie!

David qualities are contagious

oxhill · June 9, 2017, 9:38am

I’m trying to learn about SAFE Network and I’m stuck on deduplication (which is pretty early in the FAQ). I understand that the pre-encryption hash allows a client to detect if a duplicate chunk is already stored on the network, but isn’t the stored chuck encrypted by a client-specific key? If so, then client B could not decrypt the encrypted chunk stored by client A. Also, does data deduplication create a problem when a client wants to delete a file/chunk? Is there a use-count for each chunk? I’m a programmer so feel free to get technical or point me to source code files. Thanks!

neo · June 9, 2017, 10:29am

No, the self encryption algo uses the data itself to form the encryption. This allows anyone who has the datamap of the chunks to decode the file. Each chunk on its own is “impossible” to decrypt, but once you have all the chunks in sequence (datamap) then its possible to decrypt the chunks.

Sorry I don’t have the link to the self encryption encryption with me, but I’m sure a search on “self encryption” in the forum will bring a wealth of information.

EDIT: follow this link and the post has a video in it that explains it (hopefully since I didn’t watch the vid to check)

Topic		Replies	Views
Tracking files on the network Features	25	1683	September 29, 2017
De-duplication Fidelity Marketing	16	1671	September 8, 2014
Deletable Data, and Secure uses for Structured Data Features	106	5280	August 19, 2015
Abusive scenario about de-duplication Features	26	3968	November 12, 2015
Introduction to MaidSafe: what it is, how it works, and how it compares to Bitcoin Blog Posts	76	14743	March 8, 2018

Data Deduplication

Related topics