Project Gutenberg Autonomi mirror?

I’m in the process of mirroring Project Gutenberg to my local machine. Looks like its going to take a LONG time to get the 1TB’ish of all of their data on public domain books. The pile is massive, something like 75,000 public domain books in multiple formats with metadata. I plan to write a script to upload all the files and plug the metadata into Colony for searching and management, just like I did for the ia_downloader program. This has pretty good metadata though vs the Internet Archive, so it should just be shuffling of data.

For small files, my uploading results are pretty consistent around $8/GB, so I’m not going to be able to afford the gas on this one by myself. Plus, I’m not sure how important all of the formats are. If I just do a single EPUB for each book, then my guess is the archive is probably only a couple hundred GB.

Sounds like $1/GB or less is possible if the data is packed efficiently. Being a contiguous block of data like this, I think it makes sense to do some kind of data packing with multiple files included in each physical ‘file’ on Autonomi. Then we just need a schema to pull out the individual file from the chunk(s) where it resides. I’ve been thinking of adding something like this to Colony, kind of a L2 for data packing in cases like this where a pod is a massive data set of similar small files.

So I have a few questions for the group:

  • Should we upload the entire Project Gutenberg mirror or just a single file per book? On the one hand it will be a lot more expensive vs just getting the content on the network, but at the same time, there is some marketing potential saying we’re maintaining a mirror of Project Gutenberg on Autonomi. That’s pretty cool by itself. That also means we would have to maintain it, however, which would mean periodic uploads to patch existing stuff and all the complications that brings.
  • Are there any others interested in pitching in and doing a bunch of uploads like this to populate the network or fund the cause? Once I get it scripted, it should ‘just work’. Then it becomes a time, bandwidth, and money problem to deal with uploads for whoever is running the script.
  • Is it worth developing a data packing L2 mechanism for public data like this, or is it better to deal with the inefficiency and list each file as-is to reduce complications at this stage of network development? The data unpacking would be built into Colony and I would release the spec so others could use it or unpack it in their own applications, but wasn’t sure if this was something folks would be interested in. I heard @oetyng has built something into Rynn doing this data packing already, maybe we could leverage that work here?

As I’ve searched around the internet for publicly available ‘stuff’ that I could legally upload, this seemed like the most accessible pile out there with accompanying accurate metadata.

13 Likes

Well done for thinking of this and making a start.

Painful though it would be in terms of time to upload and the costs I’m sure that individual files would be better. For the reasons of keeping the files individually downloadable as quickly and simply as possible, not relying on anything to unpack them, making it more like a traditional mirror as you say and probably for other reasons not thought of yet. The only advantage of packing them is the cost and time saving.

5 Likes

Here is a bit of insight if not inspiration as to what the really BIG problem is with digitization of everything now a days from Neil Oliver just in the last day or so, very timely.

The real trick is linking what one uploads permanently , to also link the upload to the original paper document/official source.

Maybe using the ISBN links/#s might be one approach for authored works,

as for govt public documents in govt archives, the access to each is a "mileage will vary’ challenge, as they all ‘grant’ access differently to original documents,

a lot of these regimes make you file (ie FOIA in the USA) and pay (also make you reveal your identity in their track n trace BS) before granting access to what are ‘public documents’ available to any taxpayer of the regime.

so the stored map to the chunks may at one point in the future need to be expanded to store original document links?

Food for thought.

2 Likes

Neil Oliver is a self-promoting nobody, desperately trying to get his name known worldwide after (rightly) having his reputation trashed in Scotland.

Ignore the tosser. He has ZERO actual tech background. A bandwagon-jumping nobody.

Just like a stopped clock, he may be correct occasionally but it means eff-all.

1 Like

Yeah , he has a big audience of sheep as well…

1 Like