Informal-RFC - Internet-efficient ZIP file format

Problem: 100,000 small files equate to 100,000 PUT requests.
A person could zip them all up and have only 1 file to PUT less PUTs into SAFE storage. But, if that creates a ZIP file over 1GB in size, extracting 1 file out of that archive for use likely involves pulling that entire 1GB file down from the internet in order to find the “file index” saved at the end of the ZIP file.
https://en.wikipedia.org/wiki/Zip_(file_format)

I propose making a copy of the content-index as a separate file. It would be important to duplicate the ZIP file’s header (with the whole archive’s CRC-32) for validation, and then the central directory records would be duplicated in this external file. Maybe a file extension of *.netzip would be good for this index file. This pair of files would usually have the same filename and be kept together (when using a logical file path - good for traditional servers, or the psuedo-drive that a MAID disk is showing to a computer)

Benefits:

  • A small index-only file could be downloaded quickly, and it would have references to the byte-locations of the files within the archive that a user is actually wanting.
  • If the SAFE Network is implemented in such a way that in-file read-pointers could be managed, or in similar fashion to RESUME-able download requests that ask for 300 KB starting at Byte 85,000 of a 1GB file, we could reduce unneeded data transmission over the Network.
  • This 2-file archive can also reduce the number of PUTs that a user would need to pay for when storing archived data (which may ease the wide adoption of the SAFE Network as a storage platform for everyday users). Each 1MB chunk is a PUT
    The ZIP may be less total MB (less PUTs) and slack-space of less-than-1MB files is reduced (similar to a file’s size-on-disk being affected by sector size)
  • ZIP is a good candidate format to start with because changed files are traditionally added to the end of the archive without purging the old contents - just rewriting the index to point to the new start location, much like multi-session writable CDs.
  • If downloaded, the full ZIP file (without its external index file) would still function as a traditional ZIP file.

Caveats:

  • This type of archive file would be best for large quantities of small filesize files that are read and not changed.
  • If the zip contents were changed, the external index would need to be updated accordingly.

Please note, this should be considered a draft. I’m hoping to work out the details along with all of you in the community. Thanks!

3 Likes

Aren’t small files batched in 1MB sized PUT requests?

I knew that 1MB was the biggest chunk size that larger files would be PUT as. Does a 10 MB file get converted to 10x 1MB PUTs or as 1 PUT that is broken up into chunks?

I hadn’t considered that 10x 100KB files could be grouped into a single 1MB PUT. I was expecting that each file would be applied as its own PUT that would be split into chunks if the contents were greater than 1 MB.

Maybe I stepped away from the details here for too long. (I know you’re famous for creative ways of saying RTFM.)

Yes, this is how it works. The network will accept 100KB files without being chunked. But above that (I think that’s the limit) everything get’s chunked. So a 10MB. video would be chunked and self_encrypted to ten 1 MB chunks. The datamap will contain 10 hashes which allows the owner to get the chunks back and decrypt them. So for video, it could start playing if only the first 2 parts are decrypted while the rest is still downloading. There was a topic with more information about this. Can’t remember where exactly.

1 Like

10 PUTs Each chunk is sent by the client to the network to be stored. The network doesn’t know the chunks are one file or not. The data map built is what ties it together

As to the ZIP idea, I’d say that an APP is the best option to both zip and create your index and store these on your behalf. They remain your files etc. No need for the network code to do this.

1 Like

Absolutely. I had no intentions of suggesting this concept be built into the network.

I considered posting in the topic “Other Projects”, but posted in “Dev” because the use case I was exploring was specifically within the SAFE Network.


Feedback says: the benefit of reduced PUTs is off my list because each 1MB chunk is a PUT. Though, zipping may still reduce the total file size, and reduce slack-space from files smaller than 1MB being aggregated.
2 Likes

I didn’t read the fine manual either (partially because things always change, so until I hear from others the docs are reliable I won’t try to read them), I just know that writes are asynchronous, so if you write 36 files of 2 KB each and close the app, I wouldn’t expect more than 1 PUT.
I don’t know how long is the flushing interval but I also assume there will be a setting for that to allow one to avoid people to extend the interval (and with it the risk of data loss).
Unless you know these details it may make sense to read the source code first and (consider if you need to) submit an RFC later.

What about seeking in said video file? Is file access (going to be) implemented in such a way that an app could request to start at chunk 300 of a file? I’m sure that depends on the details of data-map implementation.

I think I remember seeing something in a Dev update that increasing the PUT size over 1MB was still waiting to be implemented. (Maybe it was only in the Dev-Bundle roadmap) I haven’t run test vaults myself, so I don’t know if they’ve got it working with over 1MB files yet.

1 Like

You would be able to retrieve any chunk and a video player would seek to some part of the file and the interface retrieving the file from the SAFE network would access that chunk(s)

So the same concept should* be applicable to the current standard of ZIP format. But knowing which chunk to pull requires the index (currently a variable-length segment at the end of the ZIP file).

Does anyone think that an external copy of the index would be of practical use in an internet environment? It would be almost like a thumbnail file.

It’s not a “concept” but the standard way seeking works (fseek()).
If you want to read the range between 298 and 401 MB, appropriate chunks will be requested. If the file was stored sequentially, the 300th chunk will likely be requested.

This is not related to what you’re trying to accomplish.
As someone said earlier, if anyone wants an index, they can build it for themselves.
SAFE doesn’t and shouldn’t limit choice of implementations for this (if they’re at all necessary - as I said above, I doubt they are).

Yeah, I’m just considering efficiency in working with an internet-based disk. In our current models, if we want a preview of the ZIP’s contents, the whole files is hosted on a server’s disk. The server can take a look over the whole file’s content (at physically connected drive speeds) and generate a preview (as is done with Google Drive). But with the file distributed, there’s a bottleneck on drive-access speed (being the internet pipe you’re using). Either the whole large file has to be “pulled down” or a small index could be placed beside the file for preview and selective reads by an application.

Your type of index is not necessarily the same as someone else’s index. Someone may build a big PNG file with all the pics in it. Some other vendor may build a searchable text file index.

The index file can be stored in the first chunk (if it’s smaller than 1MB) and also can be cached to save 1 GET upon app startup. But the files themselves can also be cached, especially if you use the same devices to connect to the SAFE network. As I mentioned earlier, writes are cached as well (and flushed in 1MB chunks), so where is the inefficiency that you’re trying to eliminate?

But the index used in ZIP files is specifically what I’m looking at. Copying that from inside the ZIP to outside the ZIP would be the same for anyone using this proposed format.

Note: this isn’t a SAFE-exclusive idea. An secondary ZIP-index file could work just as well for ZIPs pushed up to an FTP server or WebDAV internet-drive.

Use case:
off-site backup. A small company decides to use the SAFE Network for their off-site storage. Database backups are pulled daily and zipped. The ZIP file gets uploaded to the SAFE drive. … Three months later, an audit request creates a need to pull the history of 3 databases (of 50 that were all backed up daily). Rather than downloading the entire ZIP for all days in question, a program could look at the .netzip index and only download the 3 BAK files from within the ZIPs.
The inefficiency here would be downloading 30 GB of backups when only 3 GB of its contents are really needed. This proposed index format allows a NetZip program to seek within the ZIP file without having to read the whole file to find the end where the variable-length index is located. It would only need to download the .netzip index file, then would have references to where it should seek in the ZIP file.

Do note: I’m proposing a general-use file format. I want to discuss it here with SAFE Network because this community has extremely constructive technical minds (and I don’t have many places to find lots of those people actively involved). When the SAFE Network is running, we will run into cases like this – where a legacy file format from 1989 is adapted into an environment using live internet-disks. I’m just trying to think ahead to using it.

Limit is exactly 3072 bytes (3 KB). See self_encryption/src/lib.rs at line 454.

1 Like

If you use tar, that problem is pretty much solved, I think.

1 Like

Oh, good call. Procedural changes are usually easier than technical ones.

If the files are managed in the standard way via the NFS API, then one PUT should be enough to store a set of files when the following conditions are met:

  • An unversioned directory is used (to avoid an additional PUT to hold the version)
  • Each files is under 3KB so that the generated datamaps hold the file content
  • Total size of datamaps + metadata is under 100 KB (short file names and no user metadata can help reducing size of metadata).

So your figures (36 files of 2 KB each) should be OK for only one PUT. Additional PUTs are generated when:

  • A file is longer than 3 KB (3 additional PUTs for each file longer than 3 KB, more for each file longer than 3 MB)
  • Total size is longer than 100 KB (3 additional PUTs globally).
1 Like

It’s an interesting idea, but as I think you’ve realized, the best thing to do is advise against storing very large zip files and let the network take care of compression. Then you both avoid these issues and we all benefit from deduplication.

There might be a case for combining small files if they get stored inefficiently, in which case this would be something to build into Drive (the virtual file system), so the end user doesn’t need to worry.

I’m not sure if this is necessary, but I did read somewhere in an RFC that it is an issue, and that App Devs were to be encouraged to batch up data rather than do lots of small PUTS - so I think it is worth looking into. Could be a nice project for someone;-)

Do what any program that processes zip files

Read the last block of the zip file and progress backwards till you have the whole of the index. In the days of storing zip files on 360KB diskettes the program asked for the last diskette to be inserted when reading the zip file.

Exactly the same would be done by any program processing the zip file on SAFE or other split backup media (7zip, winzip, etc).

So behind the scenes the file system gets the last chunk of the file (maybe 2 if needed) and retrieves the index.

In reality if you use you favourite zip program and mount SAFE as a drive, then there is NO DIFFERENCE in what currently happens when you access a zip file on a disk.