Proposal - Tarchive
What?
Tarchive is a portmanteau of Tar and Archive and that describes what it is - an archive represented as a tar file. In short, it’s an archive format that has appended multiple files into a single file (tarball), which has then been self-encrypted into 3+ chunks.
Why?
Having lots of small files means lots of chunks. All files are split into a minimum of 3 chunks, so when you have a collection of many files under the 4 MB chunk size, it soon multiplies up.
The means more chunks to upload (and pay for) and more chunks to download (and route).
Tar (Tape Archive) files are an age old Unix solution to a problem of writing many files to tapes, as a contiguous stream of data.
Oddly, we find ourselves with a similar goal for writing data to the Autonomi network, to maximise the number of files we can squeeze into the smallest number of chunks.
How?
The tar format is standard. There are libraries, CLI commands, GUIs, etc. It’s a well known format.
Paraphrasing: A public archive on Autonomi is a container (stored in a chunk), which includes a list of files and their addresses. Each file is then chopped up and self-encrypted into a minimum of 3 chunks. So, an archive containing 10 x 100 KB files (1MB total), would be approximately 31 chunks (TBC) - 3x for each file and 1x for the archive itself.
Using a tarchive, the files would be placed in a tarball, which is the output from tar. The 10x 100KB files would be appended to one another, to make a single file of approximately 1MB. This tarchive could then be fed into self-encryption, resulting in 3 chunks of approximately 333KB each.
As we already have apps that can read these archives and resolve the files by name, it would be a relatively small change to reference files within the tarchive. The primary difference is that the tarchive would need to be downloaded (at least partially) in order to extract each file. This presents a different performance characteristic, but when many/all files in the archive are likely to be needed, the benefits would outweigh the negatives.
It would also mean that caching the tarchive would be desirable, as the benefit is in downloading as few chunks as possible, then re-using them to retrieve each file.
Use Cases
I would envisage apps with multiple files (e.g. css, js, assets, fonts, etc) being particularly suitable. These files would all need downloading anyway, so why not grab them all at once?
Where you lose is latency to read one file within a large archive, as the whole file would be needed.
There is likely to be less parallelism (at this time - may change in the future) in downloading a handful of large chunks, vs many small chunks. However, this may also take the strain of struggling home routers.
Edit: Would also be good for backups, to get cheapest storage for many files too. Maybe zipping would help too, but I remember reading compression was already applied during self encryption (TBC).
I thought I’d write this down for posperity and request thoughts/ideas from folks on the forum.