Does compressed data count as different?

Like if I post a drawing or song I made, and someone later posts a compression of it, would the network see that compressed version of my file as theirs?

Or could it tell the file was ripped (literally) from mine?

Sorry Mr. @dirvine if you have a sec :slight_smile:

(I could just see it being annoying if I release a new book, and their .epub ver becomes more popular than my .txt ver, and they get all the credit…)

As far as I know the files would look completely different to the network. You could change just 1 bit in a 100Gig file and the network would think it’s a complete new one.

Also the network doesn’t see files, each nodes only see pieces of file, they have no idea what the file looks like and how many pieces it’s made of.

I definitely thought that was not the case.

I heard deduplication happens on the chunk level

1 Like

I believe deduplication only works when you and me upload the exact same file. I could be wrong, but that’s how I understand it.

yeah there have been some good discussions on it, i’d reccomend checkin’ em out :smile:

Yeah, I realize I don’t know much about it, maybe I should just go head to bed. I’ll read on it tomorrow.

I just reread my last comment, and I hope I don’t sound disrespectful.

I’m just looking for an unusually specific answer here, mate

but thanks!

1 Like

Didn’t sound disrespectful at all. I just really thought I understood what exactly deduplication was. I thought that all chunks of a file were scrambled with the hash of the file, changing just one bit would radically change the hash and then it would change the chunk once encrypted.

Some more reading to do, I guess you can’t figure out 7 years of a man work in two weeks :slight_smile:

1 Like

Yeah I’m definitely not 100% on it all either,

I just remember reading de-duplication happens on the chunk level, so I just assumed data-ownership happens there as well.

It’s the only way I can imagine everything working.

It’s about the whole ‘content producer’ part of the network, which I think is still being worked out anyway.

So maybe there’s no answer yet.

Just hoped to know :slight_smile:

It does. The thing is though, if you change a small thing in what becomes the first chunk, and that changes how the rest of the file is divided (chunked), the chunks all hash out differently. As I recall, the chunks are stored according to their hash, so they’d all change.

If your file change doesn’t affect how the file breaks up, the change will only affect that chunk and the one that it is used to hash. So you’d retain most of the dedup but loose some of it.

Anyway, that’s my understanding of it.

2 Likes

There are two separate questions in this thread: (1) the de-duplication process, and (2) recognizing files for payment.

Files less than 1KB are currently stored in the DataMap itself (so with the directory metadata), and as such are not de-duplicated. Files larger than 1KB are split into a minimum of 3 chunks, where each chunk has a maximum size of 1MB. The encryption keys for each chunk are the unencrypted hashes of the prior two chunks (with modulus).

If a single bit was flipped on a file larger than 1KB, at minimum 3 chunks are changing due to the encryption process. So in a 100GB, if one bit was changed, 3MB of data are definitely uploaded, and the rest may be uploaded (when uploading occurs is whole other discussion). The user will only need enough SafeCoin for storing 3MB of data.

However, as @fergish mentioned, if bit(s) were removed or added to the first chunk, then it will likely change all of the chunks. This is not guaranteed though, because patterns in data could leave some chunks identical.

As for payment for accessing files, I’m not sure about those details, and whether they’ve even been completely finalized. Its likely this will happen at the chunk level because there are no references to files on the network (no i-nodes) - which is yet another long discussion. In other words, the algorithm would likely be “anytime these chunks are requested, notify X”. I think this is still open for discussion; I can’t think of where its implemented currently.

4 Likes

Thanks for the explanation @vtnerd.

And when a chunk becomes obsolete because the file as changed is their any mechanism to clean them up? Is this what versioning is about?

I would expect any chunk that matches, even if that chunk comes from different files, would be pointed to by multiple tables but would only be stored once. Just a guess, but I ‘think’ that is how true dedup would function.

2 Likes

I think you are right, and this would answer my previous question about what is happening to old chunks. They would remain there because the vaults won’t know if they are part of anyone else’s file or not.

Makes sense, thx.

Correct, this was discussed in another post. Its difficult to know whether anyone is still pointing to the chunk, and simultaneously store nothing that can be user identifiable with the chunk. David has some thoughts on this, but I’m not sure if its something that will make it by launch.

Also - files on the network are stored with a version history, but that history will have a hard limit. So if a file is deleted, eventually it is possible that nothing is pointing to it.

2 Likes

I don’t quite understand versioning, what’s the problem it is made to solve? And is it a list of chunks ids store in the data map that points to all previous chunks needed for a file?

Makes sense to me, thank you for that

And thank you too @vtnerd so it looks like there may not be an answer yet. Must have gotten ahead of myself :slight_smile:

1 Like

Its mainly for application developers, and its unlikely end-users will see this very often (although this will depend on the application). The advantage is that the developer can say “replace this version of the document”. If multiple computers are trying to modify the same document at the same time, only one will “win”, the other(s) will get an error. The previous versions can still be retrieved if the developer needs to inspect differences between versions and try to re-commit the desired changes.

2 Likes