EDIT: Sorry that read a little too cryptic - maybe this will help non assembler programmers
xor space divided into 65,536 (65K) sections and counting how many xor addresses fell into each section
Nice, can I suggest though change md5sum to sha256sum or similar. SHA is a more secure algo, so should distribute even better. Even though what you show is great. Just to be more accurate
I used md5sum to have small files and there is no issue with security.
This is purely a test setup on a local network.
And doesn’t the self-encryption used by “safe files upload” use sha256 and thus you have that even distribution anyhow. I was using stock standard client uploader to do this
I doubt real life files will be as well distributed as sha256
EDIT: Oh do you know how long it takes to upload 2.5 million files? And to change over to sha256sum now. Ahhhhhhhhh
I have another batch of 5 million files in the last hours of running. That’d be 7.5 million files total
And in case anyone is wondering, no i did not keep the logs
I should explain. Hash has almost zero to do with security in terms of encryption etc. It is used internally in algorithms of course.
The security from SHA is actually the distribution of the outputs. I can explain it like this
SAH is more secure than MD5 as it has less collisions and to that end it also distributes any random input more evenly across the address range.
It’s the more even distribution that also gives rise to less collisions. So you can read the security part of a hash function as its ability to more evenly distribute outputs from random inputs.
Most folk miss that. (it’s also similar to the output of a more secure symmetric encryption algorithm is more random than that of a weaker algo. i.e. AES GSM output is pretty random compared with xoring alone etc.)
Impressive work on running the analysis for so many files. It’s interesting to see the distribution results, and I agree with the suggestion of trying SHA256 for better accuracy in real-life scenarios.
I checked that the md5 created all unique files. These files were run through the standard client which does self encryption to produce the 3 chunks/records. The self encryption will produce essentially random chunks/records to upload and it uses sha256 in itself. All the generated chunks were unique as well.
The md5 is simply creating unique files to upload. Real life files will not be totally random anyhow with many being text/office/images which have similarities anyhow. Thus real life is not totally random files being uploaded.