I’m in the process of mirroring Project Gutenberg to my local machine. Looks like its going to take a LONG time to get the 1TB’ish of all of their data on public domain books. The pile is massive, something like 75,000 public domain books in multiple formats with metadata. I plan to write a script to upload all the files and plug the metadata into Colony for searching and management, just like I did for the ia_downloader program. This has pretty good metadata though vs the Internet Archive, so it should just be shuffling of data.
For small files, my uploading results are pretty consistent around $8/GB, so I’m not going to be able to afford the gas on this one by myself. Plus, I’m not sure how important all of the formats are. If I just do a single EPUB for each book, then my guess is the archive is probably only a couple hundred GB.
Sounds like $1/GB or less is possible if the data is packed efficiently. Being a contiguous block of data like this, I think it makes sense to do some kind of data packing with multiple files included in each physical ‘file’ on Autonomi. Then we just need a schema to pull out the individual file from the chunk(s) where it resides. I’ve been thinking of adding something like this to Colony, kind of a L2 for data packing in cases like this where a pod is a massive data set of similar small files.
So I have a few questions for the group:
- Should we upload the entire Project Gutenberg mirror or just a single file per book? On the one hand it will be a lot more expensive vs just getting the content on the network, but at the same time, there is some marketing potential saying we’re maintaining a mirror of Project Gutenberg on Autonomi. That’s pretty cool by itself. That also means we would have to maintain it, however, which would mean periodic uploads to patch existing stuff and all the complications that brings.
- Are there any others interested in pitching in and doing a bunch of uploads like this to populate the network or fund the cause? Once I get it scripted, it should ‘just work’. Then it becomes a time, bandwidth, and money problem to deal with uploads for whoever is running the script.
- Is it worth developing a data packing L2 mechanism for public data like this, or is it better to deal with the inefficiency and list each file as-is to reduce complications at this stage of network development? The data unpacking would be built into Colony and I would release the spec so others could use it or unpack it in their own applications, but wasn’t sure if this was something folks would be interested in. I heard @oetyng has built something into Rynn doing this data packing already, maybe we could leverage that work here?
As I’ve searched around the internet for publicly available ‘stuff’ that I could legally upload, this seemed like the most accessible pile out there with accompanying accurate metadata.