Public Datasets on Safe Network

Glad to see this topic, and I can see there’s been loads of interesting archives already mentioned.

A couple of personal interests of mine I’d like to see archived:

  • http://data.discogs.com/ This is by far the best music database available and I’ve always worried about it disappearing, if Discogs decided to disband or went bust or whatever. This dataset is obviously being modified all the time, but even just having a recent snapshot of it would make me sleep easier.
  • Welcome - Doomworld /idgames database frontend All of the community content for ID software’s Doom (believe it or not I still play it all the time :)) is collected here. Again, this one is also subject to change all the time, but a recent snapshot would be of a lot of value.
9 Likes

Carl Malamud](Carl Malamud - Wikipedia) is well on the way to creating a key part of this universal access to knowledge: a 38 terabyte index of 107 million scientific articles. The first release of what he calls the General Index can be downloaded by anyone with a fast enough Internet connection, and has no restrictions on its use.

7 Likes

Hey, cool share John. Interesting guy this Glyn. His views on copyright are something I am going to dig into.

2 Likes

Yeah he’s a good guy and always worth a read - he’s been writing on the issues of digital rights privacy and for ages

2 Likes

there should be a app/system for nodes to donate a percentage of their earnings for this project!

1 Like

Wait until the hacker data sets of peoples KYC bullshit get upped onto the safe network when corporates dont meet demands… just saying :sweat_smile:

1 Like

Do you have an estimate of how large these sets are?

1 Like

The Doomworld archive I think is only like 50 - 100GB, and the Discogs one, I think 1 - 2TB.

3 Likes

Maybe good to choose datasets which are being lost or currently at risk. Any ideas?

Here’s an example:

10 Likes

For another data set (I mentioned Scihub earlier), I also have a complete copy of the USPTO patent registry. If we could get all papers and all patents on SAFE, then that would be pretty amazing.

Big plus to Discogs, and early computing resources. I would kill for a set of ISO specifications as well (those are much harder to pirate), the technical specs for old engineering standards shouldn’t be behind such a huge paywall.

6 Likes

Might be nice, with permission, to associate this kind of archive with Aaron Swartz in some way. He died for wanting scientific knowledge to be free.

10 Likes

I stumbled on this today…

preview of content available here…

https://rachel.worldpossible.org/preview

10 Likes

filecoin is also storing public datasets via filecoin-discover

There’s a list of the datasets they’re currently storing on the network

9 Likes

The team behind archive.org list some datasets:

https://wiki.archiveteam.org/index.php/Recommended_Reading#Online_Archives_of_Interest

4 Likes

Classic

2 Likes

There’s an open source non profit alternative to StackExchange, who knew?! :man_shrugging:

Not quite a public dataset, or maybe it is, but certainly in tune with the Safe Network Fundamentals, I present:

Codidact

The Open Source Q&A Platform.

What is Codidact?

Ever had a question? Ever had an answer? Ever had knowledge to share? We believe that coming together to inquire and learn, to share knowledge and to teach each other, makes the world a better place. We also believe that Q&A communities should be free from the politics and shenanigans of private, profit-focused companies.

Community First

We’re building an open-source platform for community-driven Q&A sites. We welcome beginners and experts alike, and aim to empower communities of learners and teachers. We’re funded by seeking donations, not by paywalls, and work with individual communities that want to join us.

The Codidact Foundation

Our non-profit organization, The Codidact Foundation, is made up of people directly from our communities, located in countries all over the world. We’re incorporated in the United Kingdom as a Community Interest Company (CIC), with plans to become a Charitable Incorporated Organisation (CIO) in the future. The organization exists in order to provide a central focal point for processing costs and donations, including opening and maintaining a bank account. You can read more at our initial announcement, or view our Leadership page to learn more

Foundation: https://codidact.org

Q&A sites: https://codidact.com

Mastodon: Codidact (@codidact@fosstodon.org) - Fosstodon

4 Likes

Found this project today. Some nice data to upload to the test nets.

The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

Data Location

The Common Crawl dataset lives on Amazon S3 as part of the Amazon Web Services’ Open Data Sponsorships program. You can download the files entirely free using HTTP(S) or S3.

8 Likes

Well… I am playing with this and am having trouble with the .warc files. Never seen anything like it. Initial search returns abandoned “playback” software and discussions on how there are many .warc formats that are driving people mad.

Does anyone have any experience using these?

Ideally I’d like to transfer this data in to a database form. Year, Domain, URL, content, outbound links.

Or am I missing the point of .warc files completely?

So you’re ready to get started.

The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. The Common Crawl dataset lives on Amazon S3. You can download the files entirely free using HTTP(S) or S3.

Latest round of crawling amounted to 390 TiB. Maybe round this down to 8 decimals, anyway, 428809.53483264 GB:

May/June 2023 crawl archive now available

The data was crawled May 27 – June 11 and contains 3.1 billion web pages or 390 TiB of uncompressed content. Page captures are from 44 million hosts or 35 million registered domains and include 1.0 billion new URLs, not visited in any of our prior crawls.

For instance, Andrew Gallant aka ‘BurntSushi’ had a go at it in 2015 with an implementation written in Rust called the ‘fst’ crate:

It turns out that finite state machines are useful for things other than expressing computation. Finite state machines can also be used to compactly represent ordered sets or maps of strings that can be searched very quickly.

Here, https://github.com/BurntSushi/fst. Blog:

Index 1,600,000,000 Keys with Automata and Rust

The Common Crawl is huge. Like, petabytes huge. I’m ambitious, but not quite that ambitious. Thankfully, the good folks over at the Common Crawl publish their data set as a monthly digest. I went for the July 2015 crawl, which is over 145 TB.

That’s still a bit too big. Downloading all of that data and processing it would take a long time. Fortunately, the Common Crawl folks come through again: they make an index of all “WAT” files available. “WAT” files contain meta data about each page crawled, and don’t include the actual raw document. Among that meta data is a URL, which is exactly what I’m after.

Despite narrowing the scope, downloading this much data over a cable modem with a 2 MB/s connection won’t be fun. So I spun up a c4.8xlarge EC2 instance and started downloading all URLs from the July 2015 crawl archive with [a] shell script.

6 Likes

Apparently there’s a Flickr Foundation whose mission is to keep #Flickr and all those photos around for a hundred years. :thinking:

1 Like