Like if I post a drawing or song I made, and someone later posts a compression of it, would the network see that compressed version of my file as theirs?
Or could it tell the file was ripped (literally) from mine?
(I could just see it being annoying if I release a new book, and their .epub ver becomes more popular than my .txt ver, and they get all the creditâŚ)
As far as I know the files would look completely different to the network. You could change just 1 bit in a 100Gig file and the network would think itâs a complete new one.
Also the network doesnât see files, each nodes only see pieces of file, they have no idea what the file looks like and how many pieces itâs made of.
Didnât sound disrespectful at all. I just really thought I understood what exactly deduplication was. I thought that all chunks of a file were scrambled with the hash of the file, changing just one bit would radically change the hash and then it would change the chunk once encrypted.
Some more reading to do, I guess you canât figure out 7 years of a man work in two weeks
It does. The thing is though, if you change a small thing in what becomes the first chunk, and that changes how the rest of the file is divided (chunked), the chunks all hash out differently. As I recall, the chunks are stored according to their hash, so theyâd all change.
If your file change doesnât affect how the file breaks up, the change will only affect that chunk and the one that it is used to hash. So youâd retain most of the dedup but loose some of it.
There are two separate questions in this thread: (1) the de-duplication process, and (2) recognizing files for payment.
Files less than 1KB are currently stored in the DataMap itself (so with the directory metadata), and as such are not de-duplicated. Files larger than 1KB are split into a minimum of 3 chunks, where each chunk has a maximum size of 1MB. The encryption keys for each chunk are the unencrypted hashes of the prior two chunks (with modulus).
If a single bit was flipped on a file larger than 1KB, at minimum 3 chunks are changing due to the encryption process. So in a 100GB, if one bit was changed, 3MB of data are definitely uploaded, and the rest may be uploaded (when uploading occurs is whole other discussion). The user will only need enough SafeCoin for storing 3MB of data.
However, as @fergish mentioned, if bit(s) were removed or added to the first chunk, then it will likely change all of the chunks. This is not guaranteed though, because patterns in data could leave some chunks identical.
As for payment for accessing files, Iâm not sure about those details, and whether theyâve even been completely finalized. Its likely this will happen at the chunk level because there are no references to files on the network (no i-nodes) - which is yet another long discussion. In other words, the algorithm would likely be âanytime these chunks are requested, notify Xâ. I think this is still open for discussion; I canât think of where its implemented currently.
I would expect any chunk that matches, even if that chunk comes from different files, would be pointed to by multiple tables but would only be stored once. Just a guess, but I âthinkâ that is how true dedup would function.
I think you are right, and this would answer my previous question about what is happening to old chunks. They would remain there because the vaults wonât know if they are part of anyone elseâs file or not.
Correct, this was discussed in another post. Its difficult to know whether anyone is still pointing to the chunk, and simultaneously store nothing that can be user identifiable with the chunk. David has some thoughts on this, but Iâm not sure if its something that will make it by launch.
Also - files on the network are stored with a version history, but that history will have a hard limit. So if a file is deleted, eventually it is possible that nothing is pointing to it.
I donât quite understand versioning, whatâs the problem it is made to solve? And is it a list of chunks ids store in the data map that points to all previous chunks needed for a file?
Its mainly for application developers, and its unlikely end-users will see this very often (although this will depend on the application). The advantage is that the developer can say âreplace this version of the documentâ. If multiple computers are trying to modify the same document at the same time, only one will âwinâ, the other(s) will get an error. The previous versions can still be retrieved if the developer needs to inspect differences between versions and try to re-commit the desired changes.