Proposal: Tarchive

I’ve been digging into this a bit more this week.

The chunk streamer library accepts an offset and a limit, which means downloading parts of a tarchive is also possible.

For a prototype, I’m going to try the following with AntTP.

For uploading:

  • Create a name.tar of the collection of files
  • Generate a name.tar.json meta data (or some such) containing offsets/limits of each file.
  • Add both to a public archive and upload it.
  • Adding additional files to the tar would mean generating a new meta data file and public archive, but deduplication will help with larger archives. For smaller archives, it will likely fit within the mininal number of chunks anyway.

For downloading:

  • The public archive containing the above is downloaded
  • If the meta data file exists and a file name is provided in the query, then the exact offset/limit is used like a http range query to extract the target file directly from the tar file.
  • Adding a LRU cache for all chunks downloaded will allow the above to be cached and other chunks/files to be downloaded quickly.
  • The LRU cache will also be used for other immutables too, providing a performance boost all around.

I can then investigate the performance and see how it stands up. It should mean cheaper/faster uploads, without losing flexibility to download each file.

It may mean downloading a full 4mb chunk for a small file, but bandwidth is good, with the latency being the issue. If other files in that chunk are needed, they will also get cached for near instant retrieval.

Instead of wrapping within a public archive, it may be better to create a dedicated type. However, I suspect most of the performance gains will come from the above.

I hope to have something working on Friday, but will see how much free time I get.

5 Likes