Just finished a small analysis of the distribution of chunk xor hash/addresses

What was done

  • ran 2.5 million files through safe files upload on a local network
  • used incrementing number piped into md5sum to generate file contents
  • each file unique and generated 7.5 million chunks all unique (tested hash/xor using sort unique)
  • scanned the file building a count of hash/xor using the first 4 nibbles of the xor address resulting in 65K blocks
  • loaded into libre office calc and made a chart

Conclusions

  • distribution is acceptable for a reasonable random evenly distributed self encryption function for random files.

EDIT: Sorry that read a little too cryptic - maybe this will help non assembler programmers
xor space divided into 65,536 (65K) sections and counting how many xor addresses fell into each section

16 Likes

Nice, can I suggest though change md5sum to sha256sum or similar. SHA is a more secure algo, so should distribute even better. Even though what you show is great. Just to be more accurate

8 Likes

I used md5sum to have small files and there is no issue with security.

This is purely a test setup on a local network.

And doesn’t the self-encryption used by “safe files upload” use sha256 and thus you have that even distribution anyhow. I was using stock standard client uploader to do this

I doubt real life files will be as well distributed as sha256

EDIT: Oh do you know how long it takes to upload 2.5 million files? And to change over to sha256sum now. Ahhhhhhhhh
I have another batch of 5 million files in the last hours of running. That’d be 7.5 million files total

And in case anyone is wondering, no i did not keep the logs :rofl:

7 Likes

I should explain. Hash has almost zero to do with security in terms of encryption etc. It is used internally in algorithms of course.

The security from SHA is actually the distribution of the outputs. I can explain it like this

SAH is more secure than MD5 as it has less collisions and to that end it also distributes any random input more evenly across the address range.

It’s the more even distribution that also gives rise to less collisions. So you can read the security part of a hash function as its ability to more evenly distribute outputs from random inputs.

Most folk miss that. (it’s also similar to the output of a more secure symmetric encryption algorithm is more random than that of a weaker algo. i.e. AES GSM output is pretty random compared with xoring alone etc.)

5 Likes

Just out of interest: What is the hashing algorithm that is run against an encrypted file in the client?

4 Likes

atm it’s SHA256 (sha2)

5 Likes

Impressive work on running the analysis for so many files. It’s interesting to see the distribution results, and I agree with the suggestion of trying SHA256 for better accuracy in real-life scenarios.

I do not see that it makes any difference.

I checked that the md5 created all unique files. These files were run through the standard client which does self encryption to produce the 3 chunks/records. The self encryption will produce essentially random chunks/records to upload and it uses sha256 in itself. All the generated chunks were unique as well.

The md5 is simply creating unique files to upload. Real life files will not be totally random anyhow with many being text/office/images which have similarities anyhow. Thus real life is not totally random files being uploaded.

5 Likes