Proposal for more efficiently handling small files

I realized after this post, we really need a solution for compressing collections of small files. Even when going to a native token, it makes no sense to store mostly empty space on the network. Here are my high level implementation thoughts on this problem:

The problem

One way to think about the Autonomi network is as a giant hard drive with 12MB sectors (3 4MB chunks for self encrypted data). A sector on a hard drive is the smallest unit of space you can write to disk, so if you have a file that is 100 bytes and the sector is 512 bytes, 412 bytes of space is wasted. In my case, I’m averaging 500KB per file, so we’re wasting 90% plus. This is unacceptable.

What the user will see

In colonylib and all downstream tools that leverage it, I use an IRI like this to address files in metadata and for downloading:

ant://b5c038a434f2617a659a73ac9b4fd36874538e8f622bb10aa25fc37c2964960f

I want to slice this address into multiple fragments and index them by an integer. So for the previous example, say it contains 20 files and I want the 5th file, I would simply change the address to this content to:

ant://b5c038a434f2617a659a73ac9b4fd36874538e8f622bb10aa25fc37c2964960f/5

The ā€˜/5’ would be all that is required to indicate this uses a small file compression scheme and provide the necessary input to extract that particular file from the data.

Mapping the data

Since the files are so small compared to the 12MB minimum size, we can treat this as a small simple file system mapped as a contiguous stream of bytes. The header would be a CSV string mapping out the files contained within:

1,200,5278
2,5478,10000
3,15478,9787

The columns being:

  • the index of the file to reference from the IRI string (starting with 1, though I’m not opposed to 0 being the first one if someone has a strong opinion)
  • the position of the byte in the 12MB sector starting from the last character of the header, that indicates the first byte for this small file
  • the size in bytes of the small file

Now we could get fancy here and do JSON or something instead of CSV, it doesn’t matter just as long as it is consistent. Maybe even have a string at the top like a ā€˜#!’ type line that denotes the header format. The idea is we have some kind of index key (whether integer or string), the start of the data, the size or end of the data, and an indicator character to tell the extraction tool ā€˜this is the end of the header section’.

The only other rule would be that all small files must fit within the 12MB data space (minus the header) in their entirety, no wrapping across 12MB blocks. This isn’t a technical problem, it could be done, but it seems to add a lot more complexity than is necessary for most data collections.

I’ll build this into colonylib so when you upload/download public immutable data with these tools, it ā€˜just works’. Extracting the data will be very easy. Curating it is a little more tricky, but I think to the user it will be a similar interface to uploading a directory, the difference being it breaks up the data into these 12MB sectors (for lack of a better term) instead of a single archive. @happybeing and @traktion, maybe this is something of interest for dweb and anttp?

That’s the idea anyway. What do you guys think?

9 Likes

No, you just get smaller chunks. 4MB is maximum chunk size, not minimum. They can be very small. It’s just that the pricing is per chunk, not per MB.

3 Likes

Interesting thought - tape drive storage style on network :upside_down_face:

How do the tape drives do it?

2 Likes

AntTP already supports range requests (for streaming) if we’d have a index somewhere

1,200,5278
2,5478,10000
3,15478,9787

You could simply make a range request GET for 5478 to 10000 go get the second file (correct me if I’m wrong here but I’m pretty sure I’m not @Traktion)

So maybe just utilising standard range requests instead of special encoding via index? (because if you encode via index the backend needs to fetch the corresponding index, look for the range, and basically doing a range request itself to give you back the correct data)

You could test the method with range request (and if it really works reliably and good) pretty simply by appending some small files to one largish binary blob and then looking if

curl -r 0-999 ant://myautonomilibraryfile -o compressedOutputFile.tar

You can get back the originals by executing range requests with curl @zettawatt

2 Likes

Only reason I can see for file numbers to start with zero is that byte numbers would be starting from zero, and size starts with zero because a null length file is actually a valid file. And so file numbers start at zero for consistency. Maybe the zero file is the index itself with start byte of 0 and size being the size of the index

This is certainly a good idea. My question is now how does a normal app access the file using the standard Autonomi libraries? Does it have to implement the decoding in its logic? Rhetorical question i know.

Are you suggesting that this becomes a part of the standard Autonomi libraries?
If you are then maybe the best place for the index (directory) is in the data map. or archive record rather than in the chunks themselves.

I suggest this because if in the chunks themselves then it could become a weak (by a small amount) point in the encryption since it becomes a predictable set of values when trying to crack chunks that have this. Whereas in the datamap then it really doesn’t change things in that respect since its got that sort of pattern anyhow. Obviously for public files this is less of an issue, but hey a lot of private files can utilise this too.

5 Likes

:clap:

Uuuuh - nice - range request as part of the basic libs - why not


Edit:

(start, size) vs (start, end)?

I’d vote for start, end because that would enable range requests without calculation (e.g. In bash scripts)

5 Likes

okay - did a quick test …

index.txt

ant_comms_groups.png 0 1383935
celebrate.png 1383936 3502079
enchanted_cottage.png 3502080 6398463
anttape.png 6398464 9840639

since the public AntTP instances seem down here a command for local usage:

ant_comms_groups.png:

curl -s -r 0-1383935 http://localhost:18888/f6e57c0c182e9899c9332b2946063b76203b620a56751c4fb4683a23b0f561ba/images.tar | tar -x

celebrate.png:

curl -s -r 1383936-3502079 http://localhost:18888/f6e57c0c182e9899c9332b2946063b76203b620a56751c4fb4683a23b0f561ba/images.tar | tar -x

enchanted_cottage.png:

curl -s -r 3502080-6398463 http://localhost:18888/f6e57c0c182e9899c9332b2946063b76203b620a56751c4fb4683a23b0f561ba/images.tar | tar -x

anttape.png:

curl -s -r 6398464-9840639 http://localhost:18888/f6e57c0c182e9899c9332b2946063b76203b620a56751c4fb4683a23b0f561ba/images.tar | tar -x

possibly not ideal because the filename is baked into the tar archive (which isn’t needed … but maybe it isn’t too bad either … renaming on download is possible through:

curl -s -r 6398464-9840639 http://localhost:18888/f6e57c0c182e9899c9332b2946063b76203b620a56751c4fb4683a23b0f561ba/images.tar | tar -xO > new_name_for_anttape.png

)

5 Likes

Not to be too pedantic but tape drives just store 0s and 1s according to what the software being used has sent to them. :slight_smile:

ā€˜tar’ for example writes records of default 20 512B blocks with a 512 byte header record. There is no concept of the length of a file, just a marker of the end of a file with at least 2 empty records at the end.

I had to refresh my memory of all this from:-

Specifically this section:-

–

Rationale

Many historic tape drives read and write variable-length data blocks, leaving significant wasted space on the tape between blocks (for the tape to physically start and stop moving). Some tape drives (and raw disks) support only fixed-length data blocks. Also, when writing to any medium such as a file system or network, it takes less time to write one large block than many small blocks. Therefore, the tar command writes data in records of many 512 B blocks. The user can specify a blocking factor, which is the number of blocks per record. The default is 20, producing 10 KiBrecords.[13]

File format

There are multiple tar file formats, including historical and current ones. Two tar formats are codified in POSIX: ustar and pax. Not codified but still in current use is the GNU tar format.

A tar archive consists of a series of file objects, hence the popular term tarball, referencing how a tarball collects objects of all kinds that stick to its surface. Each file object includes any file data, and is preceded by a 512-byte header record. The file data is written unaltered except that its length is rounded up to a multiple of 512 bytes. The original tar implementation did not care about the contents of the padding bytes, and left the buffer data unaltered, but most modern tar implementations fill the extra space with zeros.[14] The end of an archive is marked by at least two consecutive zero-filled records. (The origin of tar’s record size appears to be the 512-byte disk sectors used in the Version 7 Unix file system.) The final block of an archive is padded out to full length with zeros.

Header

The file header record contains metadata about a file. To ensure portability across different architectures with different byte orderings, the information in the header record is encoded in ASCII. Thus if all the files in an archive are ASCII text files, and have ASCII names, then the archive is essentially an ASCII text file (containing many NUL characters).

The fields defined by the original Unix tar format are listed in the table below. The link indicator/file type table includes some modern extensions. When a field is unused it is filled with NUL bytes. The header uses 257 bytes, then is padded with NUL bytes to make it fill a 512 byte record. There is no ā€œmagic numberā€ in the header, for file identification.

Pre-POSIX.1-1988 (i.e. v7) tar header:

Field offset Field size Field
0 100 File path and name
100 8 File mode (octal)
108 8 Owner’s numeric user ID (octal)
116 8 Group’s numeric user ID (octal)
124 12 File size in bytes (octal)
136 12 Last modification time in numeric Unix time format (octal)
148 8 Checksum for header record
156 1 Link indicator (file type)
157 100 Name of linked file
–

If you want to find out what is on the tape you have to stream the whole length to build a list of the archives on the tape.

Other tape storage formats are available!

LTFS for example which includes an index and metadata partition.

To @zettawatt’s point I don’t think it’s a problem that a small object will ā€˜waste’ a large percentage of the minimum size of an object. The space will just not be used on the network and on the node’s storage. The only thing that is being wasted is a large percentage of the address in a very large address space. The thing that is unfortunate is the usage of a full payment of attos for the minimum size of 4MB and the very unfortunate ARB ETH to do so.

(I still think the unit size of a chunk being 4MB is too large.)

I ranged reads would for chunks would be very useful. They are very useful in object storage in general. Especially when object sizes are 64MB or something like that which is what suits object storage systems. In the science vertical object sizes are large because an individual data item is large, eg. tens of GB for a single human genome or hundreds depending on the sampling type. But you often want to read just few bytes of that genome to get at specific genes but for lots of them so ranged reads save a lot of data transfer.

So I can totally see why @riddim is interested in them.

3 Likes

okay - maybe not exactly like tapes (and especially not as slow to get to the relevant location) :smiley: buuuuut :smiley:

if you run a local instance of anttp on your linux/Mac machine:

curl -s -r 6398464-9840639 http://localhost:18888/f6e57c0c182e9899c9332b2946063b76203b620a56751c4fb4683a23b0f561ba/images.tar | tar -xO > anttape.png

will get you the relevant part of a blob of 4 pics, extract it and you’ll have a perfectly readable png in your folder =)

(now only dweb needs to come with range requests too @happybeing … and maybe utilize the standard api provided by maidsafe xD @rusty.spork - I’m pretty confident @traktion would help with implementing it)


ps: upload script used here:

import os
import tarfile
from typing import List, Tuple

base_path: str = os.path.expanduser("~/Downloads")
files: List[str] = [
    "ant_comms_groups.png",
    "celebrate.png",
    "enchanted_cottage.png",
    "anttape.png"
]

archive_path: str = os.path.join(base_path, "images.tar")
index_path: str = os.path.join(base_path, "index.txt")

byte_ranges: List[Tuple[str, int, int]] = []

# Create archive and track byte offsets
with open(archive_path, "wb") as archive_file:
    tar = tarfile.open(fileobj=archive_file, mode="w")
    for file_name in files:
        file_path = os.path.join(base_path, file_name)
        start = archive_file.tell()
        tar.add(file_path, arcname=file_name)
        tar.fileobj.flush()
        end = archive_file.tell()
        byte_ranges.append((file_name, start, end - 1))
    tar.close()

# Write index
with open(index_path, "w") as index_file:
    for name, start, end in byte_ranges:
        index_file.write(f"{name} {start} {end}\n")

possibly compressing via tar is a bit too much … but then again … it’s a compressed write strategy - right?

oooooh - I see - I don’t compress - just pack with tar xD

6 Likes

I see it as a low priority, or even wasted space as such, but an easy and universal way to access files.

It could be in the Autonomi APIs with an option on what threshold to use for packing etc but that seems over complicated, and more useful for the network just to store files. So I guess it should be a library on top, which individual apps can use if they want to compact things.

But it complicates as much as it helps. Compaction is not the priority IMO.

2 Likes

I (and probably @neo too) just meant the range request being exposed through the api - not an automatic file compaction done by the api … it prevents deduplication and would more be a method to enable more cost-efficient DB uploads

2 Likes

I’m replying to the OP, haven’t read through.

1 Like

blob viewer (no tar packing but just binary blob usage + range request on anttp)

blobViewer.html
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Blob Range Viewer</title>
  <style>
    body { font-family: sans-serif; padding: 1em; }
    a { display: block; margin: 0.5em 0; }
    img { margin-top: 1em; max-width: 100%; border: 1px solid #ccc; }
  </style>
</head>
<body>
  <h1>Blob Range Viewer</h1>
  <p>Click a file to view it:</p>

  <div id="file-list"></div>

  <h2>Displayed image:</h2>
  <img id="preview" alt="Image preview will appear here">

  <script>
    const baseUrl = "http://localhost:18888/ef25ac24affee21b12ae8ccfc7ab3fb8bc8981afa0af1eec2386cb5191f84d5d/blob.bin";

    const files = [
      ["ant_comms_groups.png", 0, 1382357],
      ["celebrate.png", 1382358, 3498642],
      ["enchanted_cottage.png", 3498643, 6393099],
      ["anttape.png", 6393100, 9833447]
    ];

    const fileList = document.getElementById("file-list");
    const preview = document.getElementById("preview");

    files.forEach(([name, start, end]) => {
      const link = document.createElement("a");
      link.href = "#";
      link.textContent = `${name} (${start}-${end})`;
      link.addEventListener("click", async (e) => {
        e.preventDefault();
        try {
          const response = await fetch(baseUrl, {
            headers: {
              "Range": `bytes=${start}-${end}`
            }
          });
          if (!response.ok && response.status !== 206) {
            alert("Failed to load file.");
            return;
          }
          const blob = await response.blob();
          preview.src = URL.createObjectURL(blob);
          preview.alt = name;
        } catch (err) {
          console.error(err);
          alert("Download error.");
        }
      });
      fileList.appendChild(link);
    });
  </script>
</body>
</html>

uploader python code
from typing import List, Tuple
import os

base_path: str = os.path.expanduser("~/Downloads")
filenames: List[str] = [
    "ant_comms_groups.png",
    "celebrate.png",
    "enchanted_cottage.png",
    "anttape.png"
]

blob_path: str = os.path.join(base_path, "blob.bin")
index_path: str = os.path.join(base_path, "index.txt")

file_offsets: List[Tuple[str, int, int]] = []

with open(blob_path, "wb") as blob_file:
    for filename in filenames:
        file_path = os.path.join(base_path, filename)
        start_pos = blob_file.tell()
        with open(file_path, "rb") as input_file:
            blob_file.write(input_file.read())
        end_pos = blob_file.tell() - 1
        file_offsets.append((filename, start_pos, end_pos))

with open(index_path, "w") as index_file:
    for name, start, end in file_offsets:
        index_file.write(f"{name} {start} {end}\n")

ps: anttp caching may make it look faster in this screen capture than it is on first loading attempt :smiley:

6 Likes

I’ll respond a bit to this post and a bit to the Project Gutenberg question of compacting files.

As already mentioned, the main issue is cost / GiB, not size per se. And maybe it will only be an issue while using ETH. But that’ll probably be for some time. Anyway..

Cost effectiveness

Self encryption in itself is allright for cost / GiB when:

  • file size is as close as possible (less than or equal) to 12 MiB, or larger than 12 MiB and a whole multiple of 4 MiB, i.e. that the final chunk is as large as possible (I may have missed something here, please correct me if so)

  • data is not likely to change

The larger the file, the less noticeable the inefficiencies caused by pricing per chunk instead of by size.

Packing

There are different reasons for packing:

  • minimize one time storage cost / [size-unit]
  • minimize cost per changed [size-unit]

If you know that your avg file size is quite low, and that there is large amount of data to add, then packing makes sense. Even more so if you know that your data is likely to change, but you do not want to keep it in mutable containers.

Who needs packing

Users who don’t upload large amounts of data won’t notice much difference. Users who upload mostly large files won’t notice a difference.

Two groups stand out (and they overlap):

  • users who upload large amounts of small files
  • users who frequently change portions of large amounts of data

I don’t think the technical side, in terms of user friendliness, is a hinder. The first ideas may be less intuitive, ergonomic etc., and not included in the client released with the network, but eventually it will be just as accessible as anything else. Self encryption is a sort of packing protocol itself, and isn’t exactly a simple concept, but behind a simple api that isn’t a hinder to usage.

Packing types

I’ll describe just shortly what Ryyn does, maybe you can get some ideas from it @zettawatt.

Ryyn packing is based on FastCDC (rust impl), configured with chunk max size 4MiB (to always fit a network container), min size 64 KiB (somewhat arbitrary) and avg size 1 MiB (semi-arbitrary).
It’s very simple to use.

Just like self encryption it chunks up the files, but 1+ instead of 3+.
This is optimised for the cases where data changes, i.e. it’s not just static files.
The hashes of the chunks are kept in an index, and every time the file changes, the changed file + old hashes are passed in to the chunker, and you get back new chunks for where changes happened, and the rest will reuse the old chunks (identified by the old hashes you gave it).
That saves a lot compared to self encryption.

Additionally (and optionally), bidiff is run on those changed chunks, since they have a granularity (by the chunk size config) and the actual changes may be much smaller than the chunk. Bidiffing extracts the actual changes from within the chunk, and outputs that as the changed chunk (simplified).

More info on the FastCDC + bidiff hybrid

Bidiff is very exact, but works best with files up to ~5 MiB, not very well with large files, as it gives high mem and cpu usage. FastCDC works well with large files, low mem and cpu, but is not so exact.
The combination is fast, low resource, and exact. The produced 4MiB from FastCDC are perfect input size for bidiff, and they are guaranteed to contain changes so minimal waste on processing data that didn’t change.

A first general localisation of changed parts in the file, removing bloat, and then a surgical extraction of the actual changed bytes.

An implementation of this hybrid can be seen here: GitHub - oetyng/deltakit: Streaming hybrid FastCDC + binary-diff

For every version of a file there is a manifest that contains everything needed to restore the file. It lists all the chunks with their offset and len in the file, the address of the network container, and the offset in it.

Example json:

{
  "file_id": "01234...",
  "file_ver": 1,
  "file_sha": "abcdef0...",
  "chunks": [
    {
      "off": 0,
      "len": 12582912,
      "sha": "0fabeab...",
      "addr": {
        "loc": "aaaaaa...",
        "off": 0
      }
    },
    {
      "off": 12582912,
      "len": 12582912,
      "sha": "ad0fbed...",
      "addr": {
        "loc": "bbbbbb...",
        "off": 16
      }
    },
    {
      "off": 25165824,
      "len": 6291456,
      "sha": "efadbede...",
      "addr": {
        "loc": "cccccc...",
        "off": 32
      }
    }
  ]
}

This manifest can be sent to anyone, and they can access the file, given that it is public or the user has the necessary key/s.
The listed chunks can be reused ones from any other file previously stored (the selection base depends on what chunk index you keep).
The chunks can be located in the same or different network containers.

A client for this could be very slim, and have just as simple api as any file upload would have.

(Btw, Ryyn encrypts individual chunks using a chunk-specific key, so when you share files, individual chunks can be read from containers without revealing the rest of the contents in a container. Meaning you can share a single version of that file, and you share only that.)⁸

4 Likes

@zettawatt, did you see this thread and my past post? Proposal: Tarchive - #23 by Traktion

Tar files are great for this stuff. Also tarindexer (python app) basically creates the manifest of offsets (and is easy to reimplement).

The chunk streamer library already accepts offsets/limits, so easy to get the right bits, from the right chunks.

A simple solution, is just wrapping the files.tar and the files.tar.idx in a public archive, then just checking for their existence on request, then parsing index for a filename, then retrieving correct bytes.

Then you can have ant://xor/filename_in_tar.txt and it will resolve from tar and return data.

To be performant, caching the chunks would be ideal though, as many files will be within the same chunks. You don’t want to have to keep downloading them.

Seems your post mostly aligns with this too?

5 Likes

(Edit: Ah, it was for OP :slight_smile: )

1 Like

Nice to see this as a POC! :smiley:

Yes, the range query stuff is already in AntTP. Better still, the functionality is provided by the underlying chunk_streamer library (which interprets the range values and fetches the right data/chunks). I created this to allow non-AntTP apps to use this, if needed.

It’s good to see the caching is working well, even with the partial ranges too. The browser must be smart enough to combine range with etag, which is good to know.

What I’d like to add to AntTP (and plan to tomorrow, if possible), is to integrate this stuff internally. So, instead of specifying a range on an XOR, you provide a filename (just like public archives with AntTP), which is then used to derive the offset/limit, based on a tar index. Then you would just have: http://localhost:18888/ef25ac24affee21b12ae8ccfc7ab3fb8bc8981afa0af1eec2386cb5191f84d5d/celebrate.png etc (or with proxy enabled, http://f25ac24affee21b12ae8ccfc7ab3fb8bc8981afa0af1eec2386cb5191f84d5d/celebrate.png)

Moreover, adding a LRU chunk cache server side would allow multiple files to be pulled from the same downloaded chunk. So, for your example POC, it would be about 4x quicker, as the first chunk would include all 4 files (if under 4mb at least).

5 Likes

:smiley: and just to mention it xD who would have guessed that people start trying to stick as much as possible into those chubby 4MB chunks xD … so the tarchive is not self encrypted anymore if I understand right? @Traktion :smiley: nice property of using a self encrypted tar archive / binary blob and just accessing it via range request is that self encryption is still in place

…nice to see that tapes are part of the fun with tar actually being the

:smiley: … ignorant me xD

4 Likes

Self-encryption still present, with tar files just like regular files, as far as the network is concerned.

When I said 4mb, I meant 12mb, I suppose, i.e. they will still be split at least 3 ways, etc.

It’s worth bearing in mind that 1 change within that 12mb is just as cheap as 1 change in a 1kb file too. It’s still essentially 3 chunks being uploaded, just that you get more data/value in them.

So, if you currently have a public archive with, say, 20 files in it and you change 1 file, you still have to upload the 3 chunks for the new file, then more (1?) for the public archive. However, providing a whole new 12mb tar file, with any number of files changed, is still the same 3 chunk change.

Ofc, you may lose something from deduplication, but that’s only really true if different files share the same chunks. I suspect that is only usually the case if someone else (or you) attempts to upload the exact same file another time. So, I’m less concerned about that… more the cost of the chunks for little files and the performance gains both up/down from grouping them as a single file.

EDIT: note that you can also append to tar files (even with replacement files, with same paths)… if they grow beyond the chunk size limits, previous chunks will still benefit from deduplication if appended + uploaded again.

(apologies for any chunk number inaccuracies - I’m just skimming over this! ha!)

4 Likes

Thanks everyone for the feedback and ideas! This is great stuff. I’ll need to spend some more time absorbing this after I get off work this evening.

To be clear, as some already noted, the motivation behind this is purely saving on upload fees. The $8/GB value I was seeing was for files averaging a few MB from the Internet Archive, this particular dataset has an average file size in the hundreds of KB, so the fees will be even more, likely $10/GB+. If I can get the price down to sub $1/GB then I can afford to upload the archive myself. Outside of this case, we’ll see similar data structures in things like git repos, though grabbing individual files may not be that important there.

I like the idea of indexing by file name like @Traktion mentioned vs index:

ant://ef25ac24affee21b12ae8ccfc7ab3fb8bc8981afa0af1eec2386cb5191f84d5d/celebrate.png

Adding this as an application level enhancement makes a lot of sense.

I need to read into the tar approach vs FastCDC. For the Project Gutenberg archive it will require periodic updates so it would be good to have an easy way to upload new 12MB blobs as changes come in on the random assortment of files.

I also agree with start/end indexing since it is easier on the extraction tool. For raw data, maybe we could encode that into the address as well, for example:

ant://ef25ac24affee21b12ae8ccfc7ab3fb8bc8981afa0af1eec2386cb5191f84d5d/0/4096

to grab the first 4096 bytes of the address (byte position 0 to 4095 inclusive). With the MaidSafe folks working on streaming downloads, this seems like something that could be put into the core libraries. Then the application level just needs to keep track of start/end location with the core libs taking care of streaming the proper fragment to the user. Any MaidSafe devs on here?

4 Likes