Proposal: Tarchive

Proposal - Tarchive

What?

Tarchive is a portmanteau of Tar and Archive and that describes what it is - an archive represented as a tar file. In short, it’s an archive format that has appended multiple files into a single file (tarball), which has then been self-encrypted into 3+ chunks.

Why?

Having lots of small files means lots of chunks. All files are split into a minimum of 3 chunks, so when you have a collection of many files under the 4 MB chunk size, it soon multiplies up.

The means more chunks to upload (and pay for) and more chunks to download (and route).

Tar (Tape Archive) files are an age old Unix solution to a problem of writing many files to tapes, as a contiguous stream of data.

Oddly, we find ourselves with a similar goal for writing data to the Autonomi network, to maximise the number of files we can squeeze into the smallest number of chunks.

How?

The tar format is standard. There are libraries, CLI commands, GUIs, etc. It’s a well known format.

Paraphrasing: A public archive on Autonomi is a container (stored in a chunk), which includes a list of files and their addresses. Each file is then chopped up and self-encrypted into a minimum of 3 chunks. So, an archive containing 10 x 100 KB files (1MB total), would be approximately 31 chunks (TBC) - 3x for each file and 1x for the archive itself.

Using a tarchive, the files would be placed in a tarball, which is the output from tar. The 10x 100KB files would be appended to one another, to make a single file of approximately 1MB. This tarchive could then be fed into self-encryption, resulting in 3 chunks of approximately 333KB each.

As we already have apps that can read these archives and resolve the files by name, it would be a relatively small change to reference files within the tarchive. The primary difference is that the tarchive would need to be downloaded (at least partially) in order to extract each file. This presents a different performance characteristic, but when many/all files in the archive are likely to be needed, the benefits would outweigh the negatives.

It would also mean that caching the tarchive would be desirable, as the benefit is in downloading as few chunks as possible, then re-using them to retrieve each file.

Use Cases

I would envisage apps with multiple files (e.g. css, js, assets, fonts, etc) being particularly suitable. These files would all need downloading anyway, so why not grab them all at once?

Where you lose is latency to read one file within a large archive, as the whole file would be needed.

There is likely to be less parallelism (at this time - may change in the future) in downloading a handful of large chunks, vs many small chunks. However, this may also take the strain of struggling home routers.

Edit: Would also be good for backups, to get cheapest storage for many files too. Maybe zipping would help too, but I remember reading compression was already applied during self encryption (TBC).

I thought I’d write this down for posperity and request thoughts/ideas from folks on the forum.

16 Likes

I expect the are use cases for this or maybe variations, so it would be good to work on.

You’ve explained the main benefits IMO.

The main downside, other than the need to either keep separate metadata or download the whole archive to know what it contains, is that any change would I think mean storing the whole archive again.

Compared to what we have now, where only changed files and a new set of metadata need uploading. It’s not clear to me which will be best for website and web app publishing, and it may well depend on the tooling used.

I think some use cases would benefit and others not, so good to have the option and ideally a common API so an app can just choose which way is best for its purpose.

I plan to build an rclone backend for Autonomi so ideas like this are very interesting to me, but I’ve not gone far into this yet. Great idea to explore!

8 Likes

Nice! I wonder if there could be some map structure added (in first chunk? a separate one?) with contents of archive, so that knowing which file you want, you could download just a specific chunk containing that file. But that would definitely break a simplicity of Autonomi+Tar and become a datatype on its own.

4 Likes

A problem with removing self encryption is undermining the way chunks and their content are obfuscated. So in general, things should be stored as encrypted chunks or self encrypted files, rather than a metadata chunk, or a file in a chunk.

But otherwise, yes. Many ways to store content and metadata.

2 Likes

For people who are after “random” access to files then it would seem that keeping the TAR file relatively small as well would be good. Even with caching it could be a lot of downloading to get 3 files from a TAR with 10,000 small files (eg photos from before 2000). If the TAR file takes up 250 chunks then getting 3 files could mean upto 250 chunks need to be downloaded.

If the TAR file is kept under say 10 chunks then those 3 files only requires upto 30 chunks.

Just a consideration when small files end up being a large TAR file.

On the other hand large files would not need to be TAR since the savings would be minimal.

Maybe when TAR small files have the option to limit the chunks per TAR file and then have multiple TAR files. Like limit to 10 chunks as default.

5 Likes

Yes, it is definitely a complimentary archive format, as there are clear pros and cons of the different approaches.

Something like a photo album as a tarchive could be nice. Same for music albums. When there are lots of files smaller than 12 MB (3x 4 MB), it could make sense. The total size may be small enough to make a full download acceptable, while reducing chunks needed to a minimum.

Where there is a natural container (e.g. album), with the content unlikely to change, it also seems like a decent fit.

Also, any time there are lots of little files. 100s of 1-10 KB files would benefit hugely from having 3 chunks, rather than hundreds. Application assets could fall into this bucket, where a high proportion of the assets may be needed all the time too.

3 Likes

Interestingly, you could still rdiff the whole tarchive and create a delta for the changes. Depending on how smart the algorithm is, it may also cope with new files being added in the middle too, etc.

It may actually be easier in some ways. Having only 1 file and a series of deltas (for change sets) is simpler than lots of files, each with their deltas.

Calculating the potential XOR of the original with deltas applied should be easy to do offline too, i.e. calculate the potential XOR of the local files vs last known online version with deltas applied.

You may be able to use simpler binary diff commands/libraries too. I gather xdelta will play nicely with tarballs, for example.

It could be much easier than rdiff for a whole directory tree, etc. Not sure how delta efficiency would compare, but complexity seems much reduced.

Edit: to add, you could create a new tar, appending the delta. Maybe that would still dedupe for all but the last chunk(s) (where delta(s) is added)? :thinking:

2 Likes

And instead of using TAR, which is not efficient for accessing a specific file, would it be better to use RAR or 7Z, which are much more efficient due to their use of metadata? Would that allow downloading only the metadata table and the essential chunks?

And wouldn’t it be possible, and better, to use Scratchpad instead of Immutable Chunks for this case?

1 Like

Potentially, yes, but it depends. Tar is obviously a very simple format to stitch files together. This makes it the most comparable to current autonomi archive types.

Given that data is compressed during self encryption (IIRC), double compressing wouldn’t be a benefit. It could also change every byte in the file/archive, making it hard to diff against or dedupe chunks. So, then it’s a question of whether an header index is valuable or not and what format that should be in.

Tarballs have an advantage of being compatible with append only behavior. This may have advantages for deduplication on modification.

Chunks vs scratchpad may be a larger discussion. The network design lends itself well to use chunks for permanent storage. Scratchpad is more for temporary data.

There also the option of using History/Registers to handle increments. And maybe a periodic full tar update.

1 Like

I might try prototyping just using a regular tarball uploaded as a file. Just to keep it super simple.

Checking the file is a tar file, then allowing files to be read directly from it, would be pretty handy out of the box. There are rust libs to read tars easily too, so should be straightforward.

Rust libs:

2 Likes

The more I think about what kind of applications could succeed in Autonomi, the less I believe that the use of self-encryption will be the most common.

The entire historical foundation of Safe-Autonomi was based on the idea that data storage would be very cheap, so thinking that almost all information would be immutable chunks could be correct.
That breaks down the moment it shifts to blockchain, and the cost of storage, mainly due to fees, starts to become significant.

This will make the optimized use of different data types and minimizing the number of transactions extremely important. Therefore, the applications that will succeed will be those that minimize storage as much as possible.

It also significantly increases the value of Scratchpad, which, when used correctly, can store data much more economically than Chunks for data where eternal immutability is not necessary.

Likewise, I wonder why someone would use self-encryption, with the complication of having to use AES encryption and Brotli compression, to store certain public data, like videos, where there is no size benefit and when a simple split/cat and a list of XOR addresses would suffice.

We’re only scratching the surface, but I think we might find that the way Autonomi is used will be very different from what we expected.

3 Likes

Unencrypted data is also a danger to those storing it. Having plausible deniability for what is on your node is important. Folks may accidentally leak all sorts of private data that they didn’t mean to too.

IIRC, scratchpads are only 1 MB and you still need to pay to create them. For small tarchives, that may be a benefit, but for those exceeding 1 MB, will it become more expensive?

Moreover, the immutability of scratchpads makes them harder to cache. As their content isn’t used as an address, it’s harder to protect against tampering too. Maybe TTLs could be set to define caching behavior, but that may only be useful for end clients, unless this is coded into the datatype/network layer.

Again, you could use etags to avoid cache expiry, but that is still an overhead.

Without extensive caching, popular data will slow to a crawl. Immutable data is a big part of the solution to this.

Also, there is no deduplication. If most of the chunks are unchanged, they are ‘free’ too, in the context of uploading again. Append only data structures could benefit from this, cost wise.

Ultimately, these are reasons for a native currency, as once we have it, the benefits of free scratchpad updates deminish. I also wouldn’t be surprised if free, changes to cheap, to avoid this sort of chicanery to avoid fees.

So, I hear your point and the economics will certainly drive things. However, those economics are certain to change with the introduction of a native currency and perhaps moves to prevent scratchpad abuse.

3 Likes

Many people, for example, trying to replace torrent with Autonomi, don’t care about the nodes. All they want is to use the simplest and cheapest option. In these cases, self-encryption is a hindrance.

Scratchpad are 4MB. The same size as a chunk with many more advantages. The simple fact that you can reuse them at no cost, and without needing to interact with the slow Arbitrum network, would enable developments that are unthinkable to achieve with chunks.

I hope so, but I fear we won’t see a native token in Autonomi for a long time, if we ever do.

1 Like

If it means folks will be afraid to run nodes, that’s not good. That’s my primary worry, although stupid people uploading private stuff could be a problem too. It’s also counter to what people have been told.

Ah, my mistake. Thanks.

Are they the same price to create as immutable chunks?

The argument is pretty clear that we need it. Much of what you have said reinforces that too.

The team have said repeatedly that it is high on the priority list, once the network is stable.

We also have community devs looking at creating one to help move it forward quicker.

This suggests to me that it’s when we go native, not if.

1 Like

:confetti_ball:

1 Like

All of this feels highly relevant to what I’m working on myself. I’m planning to build a knowledge hub—maybe even a forum, if that’s technically feasible—on the Autonomi network, focused on underground chemistry. Think of it as a continuation of those sites from the ’90s that fed my curiosity and empowered a kind of citizen science, the ripple effects of which can be seen today in the mainstreaming of psychedelic medical research. The Hive, Lycaeum, Rhodium archives…

As a starting point, I plan to archive active communities like www.thevespiary.org and www.sciencemadness.org into the Autonomy network. Not to replace them, but to preserve and mirror them in a space where censorship, takedowns, and central points of failure are no longer a threat.

I truly believe that the communities behind these forums would benefit massively from what Autonomy offers. In fact, I’d bet the psychedelic scene in particular is full of people who could easily become early adopters—people who already understand the value of privacy, decentralization, and publishing without permission.

And it’s not a one-way street. The presence of psychonauts, underground chemists, and radical learners on Autonomy would also give this network more gravity. These are people with strong cultural momentum, real-world motivation, and a proven track record of building resilient, decentralized knowledge networks long before “Web3” was a buzzword.

I want to emphasize how important it is that data on Autonomy is immutable. Once it’s published, no amount of pressure or threats can make it disappear. That kind of permanence is exactly what sensitive, non-mainstream knowledge needs to survive long-term.

I also get that this kind of project may be controversial, even in a community like this. So let me clarify: sites like The Vespiary and ScienceMadness are, to my knowledge, perfectly legal and operate openly on the clearnet—not hidden away in some shady corner of Tor. This isn’t about promoting illegal activity—it’s about preserving open access to knowledge that has real cultural and scientific value.

5 Likes

Don’t expect that to continue. Once costing is based on size and record type (Already in the quote metrics) then they can put a cost on updating a scratchpad record.

1 Like

And then, what do I rely on, what currently exists or what may exist in the future?

This It is changing the rules of the game in the middle of the match.

Its been changing for years. Scratchpad did not even exist 12 months ago. And there is no reason to believe that you can keep writing data for free.

The only free is transactions for the native coin. But its even been suggested that that might not be entirely free, just extremely and always small.

Its not changing the rules of the game but evolving into the network as its meant to be. Free data is not what should be happening. And pay once for 4MB space you can then write PB’s worth of data over the years seems to be against that.

5 Likes