We’ve long accepted a standardised chunk size of 1MB, so perhaps this should be reviewed before it is set in stone?
Firstly, it is unfortunate that chunks have to be a standard size, but that is necessary for deduplication. Any other reasons?
Why does size matter? Well for various reasons I’ll leave out this post for now, but there will be a Goldilocks zone, ‘not too big and not to small’.
However, factors affecting that zone will change over time, and have almost certainly changed since David & now ex MaidSafe developers settled on the 1MB figure years ago.
So I pose two questions for this topic:
Is 1MB still optimal?
Are there ways that future adjustments can mitigate movement of the Goldilocks zone (not necessarily changing chunk size because that will be problematic)?
On the second point I wonder if a future upgrade could deal in groups of chunks although as soon as I start to consider how they would be addressed and located my head explodes.
MAX chunk size and storage size are both related magic numbers
Changing chunk size would be a difficult task in a live network as we cannot tell if existing chunks are from version 1 or not, so this may be a time to look at versions, although the only person that can reupload a chunk is the data map holder, who may be offline for a long period or forever)
chunks are unlinkable and information theoretically secured which is very secure but leads to other issues if we try and group them etc.
To clarify, chunk size affects deduplication in that if you change it, data is effectively going to be duplicated if uploaded after a change in chunk size. And since that also involves reuploading and possibly repayment, chunk size becomes a difficult thing to change once it has been set. Hence my wondering if chunks could be grouped together instead, but I don’t see how that could work.
So there’s a big question over whether it is ever likely to be feasible to change after launch, even if 1MB ceases to be a good choice as technology changes.
I think in earlier discussions one of the reasons for choosing 1MB was due to it suiting the transport layer. Perhaps fitting within a single packet? If so will use of QUIC and any other changes since it was set affect the reasoning?
Thinking a bit more about ways to change chunk size later…
I guess even after the chunk size had been changed, any uploader of an existing file could check first and see it has been uploaded with an earlier chunk size. Quite an expensive thing to check so not necessarily worthwhile?
A way to mitigate that costly check would be for the network to maintain a map between the hash of every public file and any data maps existing for the file (one for each chunk size that has been used). I’m not sure if that will affect privacy.
It’s not immediately apparent that it creates a privacy problem, but one it does create is how to store that map. It adds complexity but not that much, and might have extra benefits in terms of applications.
If there were a map between the hash of every public file and the data maps allowing access to the file (with one for each chunk size uploaded), what uses might it have?:
It makes it easy for anyone uploading a public file after a chunk size change to check if the file is already available with an earlier chunk size.
People with access to any uploaded public file could publish a metadata index referring to the map which is a step towards a Safe Search app. I guess that is a privacy risk, but only for those choosing to publish metadata for files for which they know the risks of possessing, and that can of course be published anonymously.
Any other uses?
Obviously people could create such indexes regardless, but baking this map into the network of public data might help unify those metadata indexes, as well as avoiding file atrophy and duplication on each change in chunk size.
If I self encrypt a file that is 500K then the result is 3 chunks of approx 170K
Self encrypt will always give the same chunks for the same file input.
Obviously if you change the input file then you change the chunks
It seems what you are really asking is to reconsider the Maximum chunk size. And changing that changes dedup for files over 3MB
That is a good question.
My thought is that since we are basing the network on internet protocols which have packet size limits then 1MB is not so bad when you consider that. For TCP with 1500 byte packets then there are around 670 packets per chunk and larger max chunk size means (while small) greater error rates in downloading a chunk which increases download time with the retries.
The world in general still has limits on upload speeds. By having 1MB instead of say 10MB we take advantage of 10 times the nodes to download a large file from and considering its has parallelism then download speeds will be faster for 1MB chunk max size than for 10MB max size (for larger files). IE people with slower upload speeds will not affect as much the overall download for files for 1MB chunks compared to 10MB chunks
You have to calculate where to store each chunk, get price, pay for it,… When you split file into more chunks, you get more overhead (CPU, RAM, network connections).
There is a point at which the work to look after the chunks overtakes any benefits. So the benefits of the smaller size allowing more nodes in the download parallelism and slow nodes having lesser delays on the whole download will be overwhelmed by the overheads.
Yes there maybe benefits to going to 1/2 MB and I am sure there is an argument to be made to do so. But I would hesitate to say 1/4 MB or 100KB is better again due to overheads & messaging creating a storm of messages just for uploading.
If we examine ADSL2 which has an upload speed of 1Mbits/s and compare with 1, 5, 10MB chunks fastest delivery times. (protocol overheads not taken into account for comparison). For faster upload speeds (say 10Mbits/s just divide by 10) Australia typically has 10 & 20Mbits/s upload and some 40Mbits/s and above. Starlink is 20Mbits/s
1MB 8 seconds
5MB 40 seconds
10MB 80 seconds
Remember that for file sizes less than 3MB the chunk sizes will be less than 1MB anyhow so web pages are not typically 1MB anyhow, although the images at times can be over 1MB
I assume the majority of people now, or in near future, will at least have 10Mbits/s upload and this means that using chunk size of 1MB is sub second to deliver and going smaller chunk size may not be noticeable in the typical file/chunk download. Lag and overheads become a significant part of the time taken. But to go to 5MB or 10MB chunk size the time is 4 or 8 seconds and is noticeable which could potentially make downloads longer even with parallelism.
To answer your question now after thinking out loud, 1MB provides a decent balance between speed and overheads. Going lower will see TCP/UDP protocol overheads become increasingly significant not to mention the overheads of SAFE messaging overheads and payment overheads when uploading chunks. Going higher will see the upload times be greater than 1 second (maybe upto 10 seconds for 10MB) for the chunk for the general internet connection in many countries.
After considering the question it is my opinion that 1MB max chunk size is a good sweet spot for many years to come. Maybe when it comes time for a SAFE New Generation then I am sure speeds will favour a larger max size for the chunk if that is even the basis for storage in the NG Safe…