Glad to see this topic, and I can see there’s been loads of interesting archives already mentioned.
A couple of personal interests of mine I’d like to see archived:
http://data.discogs.com/ This is by far the best music database available and I’ve always worried about it disappearing, if Discogs decided to disband or went bust or whatever. This dataset is obviously being modified all the time, but even just having a recent snapshot of it would make me sleep easier.
Welcome - Doomworld /idgames database frontend All of the community content for ID software’s Doom (believe it or not I still play it all the time :)) is collected here. Again, this one is also subject to change all the time, but a recent snapshot would be of a lot of value.
For another data set (I mentioned Scihub earlier), I also have a complete copy of the USPTO patent registry. If we could get all papers and all patents on SAFE, then that would be pretty amazing.
Big plus to Discogs, and early computing resources. I would kill for a set of ISO specifications as well (those are much harder to pirate), the technical specs for old engineering standards shouldn’t be behind such a huge paywall.
There’s an open source non profit alternative to StackExchange, who knew?!
Not quite a public dataset, or maybe it is, but certainly in tune with the Safe Network Fundamentals, I present:
Codidact
The Open Source Q&A Platform.
What is Codidact?
Ever had a question? Ever had an answer? Ever had knowledge to share? We believe that coming together to inquire and learn, to share knowledge and to teach each other, makes the world a better place. We also believe that Q&A communities should be free from the politics and shenanigans of private, profit-focused companies.
Community First
We’re building an open-source platform for community-driven Q&A sites. We welcome beginners and experts alike, and aim to empower communities of learners and teachers. We’re funded by seeking donations, not by paywalls, and work with individual communities that want to join us.
The Codidact Foundation
Our non-profit organization, The Codidact Foundation, is made up of people directly from our communities, located in countries all over the world. We’re incorporated in the United Kingdom as a Community Interest Company (CIC), with plans to become a Charitable Incorporated Organisation (CIO) in the future. The organization exists in order to provide a central focal point for processing costs and donations, including opening and maintaining a bank account. You can read more at our initial announcement, or view our Leadership page to learn more
Found this project today. Some nice data to upload to the test nets.
The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
Data Location
The Common Crawl dataset lives on Amazon S3 as part of the Amazon Web Services’ Open Data Sponsorships program. You can download the files entirely free using HTTP(S) or S3.
Well… I am playing with this and am having trouble with the .warc files. Never seen anything like it. Initial search returns abandoned “playback” software and discussions on how there are many .warc formats that are driving people mad.
Does anyone have any experience using these?
Ideally I’d like to transfer this data in to a database form. Year, Domain, URL, content, outbound links.
Or am I missing the point of .warc files completely?
The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. The Common Crawl dataset lives on Amazon S3. You can download the files entirely free using HTTP(S) or S3.
Latest round of crawling amounted to 390 TiB. Maybe round this down to 8 decimals, anyway, 428809.53483264 GB:
The data was crawled May 27 – June 11 and contains 3.1 billion web pages or 390 TiB of uncompressed content. Page captures are from 44 million hosts or 35 million registered domains and include 1.0 billion new URLs, not visited in any of our prior crawls.
For instance, Andrew Gallant aka ‘BurntSushi’ had a go at it in 2015 with an implementation written in Rust called the ‘fst’ crate:
It turns out that finite state machines are useful for things other than expressing computation. Finite state machines can also be used to compactly represent ordered sets or maps of strings that can be searched very quickly.
The Common Crawl is huge. Like, petabytes huge. I’m ambitious, but not quite that ambitious. Thankfully, the good folks over at the Common Crawl publish their data set as a monthly digest. I went for the July 2015 crawl, which is over 145 TB.
That’s still a bit too big. Downloading all of that data and processing it would take a long time. Fortunately, the Common Crawl folks come through again: they make an index of all “WAT” files available. “WAT” files contain meta data about each page crawled, and don’t include the actual raw document. Among that meta data is a URL, which is exactly what I’m after.
Despite narrowing the scope, downloading this much data over a cable modem with a 2 MB/s connection won’t be fun. So I spun up a c4.8xlarge EC2 instance and started downloading all URLs from the July 2015 crawl archive with [a] shell script.