I realized after this post, we really need a solution for compressing collections of small files. Even when going to a native token, it makes no sense to store mostly empty space on the network. Here are my high level implementation thoughts on this problem:
The problem
One way to think about the Autonomi network is as a giant hard drive with 12MB sectors (3 4MB chunks for self encrypted data). A sector on a hard drive is the smallest unit of space you can write to disk, so if you have a file that is 100 bytes and the sector is 512 bytes, 412 bytes of space is wasted. In my case, Iām averaging 500KB per file, so weāre wasting 90% plus. This is unacceptable.
What the user will see
In colonylib and all downstream tools that leverage it, I use an IRI like this to address files in metadata and for downloading:
ant://b5c038a434f2617a659a73ac9b4fd36874538e8f622bb10aa25fc37c2964960f
I want to slice this address into multiple fragments and index them by an integer. So for the previous example, say it contains 20 files and I want the 5th file, I would simply change the address to this content to:
ant://b5c038a434f2617a659a73ac9b4fd36874538e8f622bb10aa25fc37c2964960f/5
The ā/5ā would be all that is required to indicate this uses a small file compression scheme and provide the necessary input to extract that particular file from the data.
Mapping the data
Since the files are so small compared to the 12MB minimum size, we can treat this as a small simple file system mapped as a contiguous stream of bytes. The header would be a CSV string mapping out the files contained within:
1,200,5278
2,5478,10000
3,15478,9787
The columns being:
- the index of the file to reference from the IRI string (starting with 1, though Iām not opposed to 0 being the first one if someone has a strong opinion)
- the position of the byte in the 12MB sector starting from the last character of the header, that indicates the first byte for this small file
- the size in bytes of the small file
Now we could get fancy here and do JSON or something instead of CSV, it doesnāt matter just as long as it is consistent. Maybe even have a string at the top like a ā#!ā type line that denotes the header format. The idea is we have some kind of index key (whether integer or string), the start of the data, the size or end of the data, and an indicator character to tell the extraction tool āthis is the end of the header sectionā.
The only other rule would be that all small files must fit within the 12MB data space (minus the header) in their entirety, no wrapping across 12MB blocks. This isnāt a technical problem, it could be done, but it seems to add a lot more complexity than is necessary for most data collections.
Iāll build this into colonylib so when you upload/download public immutable data with these tools, it ājust worksā. Extracting the data will be very easy. Curating it is a little more tricky, but I think to the user it will be a similar interface to uploading a directory, the difference being it breaks up the data into these 12MB sectors (for lack of a better term) instead of a single archive. @happybeing and @traktion, maybe this is something of interest for dweb and anttp?
Thatās the idea anyway. What do you guys think?