Public Datasets on Safe Network

chriso · October 4, 2021, 1:01pm

Glad to see this topic, and I can see there’s been loads of interesting archives already mentioned.

A couple of personal interests of mine I’d like to see archived:

http://data.discogs.com/ This is by far the best music database available and I’ve always worried about it disappearing, if Discogs decided to disband or went bust or whatever. This dataset is obviously being modified all the time, but even just having a recent snapshot of it would make me sleep easier.
Welcome - Doomworld /idgames database frontend All of the community content for ID software’s Doom (believe it or not I still play it all the time :)) is collected here. Again, this one is also subject to change all the time, but a recent snapshot would be of a lot of value.

JPL · October 23, 2021, 1:34pm

Carl Malamud](Carl Malamud - Wikipedia) is well on the way to creating a key part of this universal access to knowledge: a 38 terabyte index of 107 million scientific articles. The first release of what he calls the General Index can be downloaded by anyone with a fast enough Internet connection, and has no restrictions on its use.

Nigel · October 23, 2021, 1:41pm

Hey, cool share John. Interesting guy this Glyn. His views on copyright are something I am going to dig into.

JPL · October 23, 2021, 2:13pm

Yeah he’s a good guy and always worth a read - he’s been writing on the issues of digital rights privacy and for ages

SmoothOperatorGR · October 26, 2021, 7:06am

there should be a app/system for nodes to donate a percentage of their earnings for this project!

MaxSan · October 26, 2021, 7:48am

Wait until the hacker data sets of peoples KYC bullshit get upped onto the safe network when corporates dont meet demands… just saying

MaxSan · October 26, 2021, 7:49am

Do you have an estimate of how large these sets are?

chriso · October 27, 2021, 5:42pm

The Doomworld archive I think is only like 50 - 100GB, and the Discogs one, I think 1 - 2TB.

happybeing · January 30, 2022, 9:22pm

Maybe good to choose datasets which are being lost or currently at risk. Any ideas?

Here’s an example:

dask · January 31, 2022, 1:02pm

For another data set (I mentioned Scihub earlier), I also have a complete copy of the USPTO patent registry. If we could get all papers and all patents on SAFE, then that would be pretty amazing.

Big plus to Discogs, and early computing resources. I would kill for a set of ISO specifications as well (those are much harder to pirate), the technical specs for old engineering standards shouldn’t be behind such a huge paywall.

happybeing · January 31, 2022, 1:32pm

Might be nice, with permission, to associate this kind of archive with Aaron Swartz in some way. He died for wanting scientific knowledge to be free.

Zoki · February 10, 2022, 7:13pm

I stumbled on this today…

preview of content available here…

https://rachel.worldpossible.org/preview

mav · February 17, 2022, 3:29am

filecoin is also storing public datasets via filecoin-discover

There’s a list of the datasets they’re currently storing on the network

mav · June 9, 2023, 4:29am

The team behind archive.org list some datasets:

https://wiki.archiveteam.org/index.php/Recommended_Reading#Online_Archives_of_Interest

sam1 · June 9, 2023, 5:38am

Classic

happybeing · June 12, 2023, 10:57am

There’s an open source non profit alternative to StackExchange, who knew?!

Not quite a public dataset, or maybe it is, but certainly in tune with the Safe Network Fundamentals, I present:

Codidact

The Open Source Q&A Platform.

What is Codidact?

Ever had a question? Ever had an answer? Ever had knowledge to share? We believe that coming together to inquire and learn, to share knowledge and to teach each other, makes the world a better place. We also believe that Q&A communities should be free from the politics and shenanigans of private, profit-focused companies.

Community First

We’re building an open-source platform for community-driven Q&A sites. We welcome beginners and experts alike, and aim to empower communities of learners and teachers. We’re funded by seeking donations, not by paywalls, and work with individual communities that want to join us.

The Codidact Foundation

Our non-profit organization, The Codidact Foundation, is made up of people directly from our communities, located in countries all over the world. We’re incorporated in the United Kingdom as a Community Interest Company (CIC), with plans to become a Charitable Incorporated Organisation (CIO) in the future. The organization exists in order to provide a central focal point for processing costs and donations, including opening and maintaining a bank account. You can read more at our initial announcement, or view our Leadership page to learn more

Foundation: https://codidact.org

Q&A sites: https://codidact.com

Mastodon: Codidact (@codidact@fosstodon.org) - Fosstodon

Zoki · August 6, 2023, 8:02am

Found this project today. Some nice data to upload to the test nets.

The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

Data Location

The Common Crawl dataset lives on Amazon S3 as part of the Amazon Web Services’ Open Data Sponsorships program. You can download the files entirely free using HTTP(S) or S3.

Zoki · August 26, 2023, 10:13am

Well… I am playing with this and am having trouble with the .warc files. Never seen anything like it. Initial search returns abandoned “playback” software and discussions on how there are many .warc formats that are driving people mad.

Does anyone have any experience using these?

Ideally I’d like to transfer this data in to a database form. Year, Domain, URL, content, outbound links.

Or am I missing the point of .warc files completely?

JBishop · August 26, 2023, 7:18pm

So you’re ready to get started.

The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. The Common Crawl dataset lives on Amazon S3. You can download the files entirely free using HTTP(S) or S3.

Latest round of crawling amounted to 390 TiB. Maybe round this down to 8 decimals, anyway, 428809.53483264 GB:

May/June 2023 crawl archive now available

The data was crawled May 27 – June 11 and contains 3.1 billion web pages or 390 TiB of uncompressed content. Page captures are from 44 million hosts or 35 million registered domains and include 1.0 billion new URLs, not visited in any of our prior crawls.

For instance, Andrew Gallant aka ‘BurntSushi’ had a go at it in 2015 with an implementation written in Rust called the ‘fst’ crate:

It turns out that finite state machines are useful for things other than expressing computation. Finite state machines can also be used to compactly represent ordered sets or maps of strings that can be searched very quickly.

Here, https://github.com/BurntSushi/fst. Blog:

Index 1,600,000,000 Keys with Automata and Rust

The Common Crawl is huge. Like, petabytes huge. I’m ambitious, but not quite that ambitious. Thankfully, the good folks over at the Common Crawl publish their data set as a monthly digest. I went for the July 2015 crawl, which is over 145 TB.

That’s still a bit too big. Downloading all of that data and processing it would take a long time. Fortunately, the Common Crawl folks come through again: they make an index of all “WAT” files available. “WAT” files contain meta data about each page crawled, and don’t include the actual raw document. Among that meta data is a URL, which is exactly what I’m after.

Despite narrowing the scope, downloading this much data over a cable modem with a 2 MB/s connection won’t be fun. So I spun up a c4.8xlarge EC2 instance and started downloading all URLs from the July 2015 crawl archive with [a] shell script.

happybeing · October 7, 2023, 11:10am

Apparently there’s a Flickr Foundation whose mission is to keep #Flickr and all those photos around for a hundred years.

Topic		Replies	Views
People follow data... bring data to SAFE early, pre-launch. Lets put Wikipedia out of its misery. ;) Marketing	4	1256	July 20, 2017
Indexing public data on safe Apps	3	1101	January 16, 2017
Storage of Government data Beginners	9	584	February 22, 2019
How do we better educate the public on SAFENetwork? Sway opinions and such Marketing	17	1297	May 14, 2018
Public notebook on everything about SAFE Network Community	37	2318	March 10, 2024

Public Datasets on Safe Network

Codidact

The Open Source Q&A Platform.

What is Codidact?

Community First

The Codidact Foundation

Data Location

Related topics