Public Datasets on Safe Network

Traktion · April 21, 2021, 7:52am

Well, they seem to be willing to try it, given the filecoin foundation is subsidising it:

http://blog.archive.org/2021/04/01/filecoin-foundation-grants-50000-fil-to-the-internet-archive/

Traktion · April 21, 2021, 8:01am

Interesting. They’ve offered this sort of thing at AWS etc for years. The problem is, it is centralising. I mean, how many places can you send the drives to? How can the locations of these places remain anonymous? How can the hosts deny knowledge of knowingly storing or hosting that data?

Imo, this actually highlights the inherent weaknesses of filecoin. It is another distributed market for storage, but it isn’t really moving the game forward in how the data is replicated, distributed or accessed.

I’m sure there is a market for decentralised what AWS does, but I wonder about how they will compete with the efficiencies of scale, robustness, flexibility and resilience of what centralised solutions already deliver. Is it worth it?

JimCollinson · April 21, 2021, 8:03am

Oh yeah, I knew there was some collaboration going on. What I mean is, it potentially not quite the complete fit yet. Stepping stone maybe though.

I recall the convo that David had with Brewser though, when he told him he was hoping to put IA out of business

That should be the ultimate aim though right?

Traktion · April 21, 2021, 8:06am

Yes, a stepping stone perhaps. I suppose it helps them towards their goal. Free storage, is free storage, after all. Ofc, it would be an ongoing cost, rather than being a one off cost.

Love the brewser comment! So true!

JimCollinson · April 21, 2021, 8:07am

To be fair, it’s always been David’s stated aim to put himself out of business too, so it’s evens.

Traktion · April 21, 2021, 8:09am

I can’t remember that in the bnktothefuture pitch!

Seriously though, an automated network that retains history would be awesome. I’m sure there will always be some fettling and improvements to be made somewhere around the system though!

peca · April 21, 2021, 8:10am

If we set right limits to node size (or better some algorithm that sets the limits according to network age and orher parameters) “archiving The Archive” could be perfect for stabilising network growth. Do it in small parts or with very variabe speed, stop it when there is enough upload of other data and speeed it up when thre is large number of nodes waiting to join.

lukas · April 21, 2021, 10:09am

If the wallet for putting clearnet data on safenet is public, how would it be possible to ensure that the tokens are only used for the intended purpose?

I’m not 100% sure if the safe:// address for the same content is always the same even when uploaded by different owners (I know it was supposed to be at some point in development). If not the following ideas would not work.

Idea 1

Maybe you would need to pay for the upload yourself, and then you can provide the safe:// of the data along with the http:// of the same data. The owner of the wallet first checks if the data has already been uploaded, if not it generates the safe:// for the data from the content at the http:// and compares it to the safe://. If they match you know it’s the right data, and the wallet owner pays the uploader.

Idea 2

Hmm, another idea, don’t know if safenet supports it:

The uploader requests a certain data to be uploaded.
The wallet owner safe dogs the address to make sure it doesn’t exist yet.
The wallet owner pays for the upload of that data on the network, saves the “upload certificate” in a list of stuff not uploaded, and sends the “certificate” to the uloader.
The uploader performs the upload.
Once the wallet owner has verified that the data is uploaded the “certificate” is removed from the list.

If for some reason the data is actually not uploaded, anyone can request an “upload certificate” from the wallet owner and then upload that data.

These ideas allow people to donate to a wallet they trust to use the tokens for the intended purpose. The owners of these centralised upload servers would keep a list of all things they have uploaded in a way that people who use the system could automatically verify that the tokens are indeed all used for their intended purpose.

Nigel · April 21, 2021, 12:54pm

Okay so I was thinking on a site by site basis like each site has its own wallet. An over arching archive wallet is a neat idea even when funds are empty and it incurs a cost the internet archive could archive data at a cheaper cost or at least offset the cost by farming with their equipment.

Yeah just ideas here too but fun and those ideas could be impactful if sorted out.

JimCollinson · April 21, 2021, 1:12pm

Well, we could actually nest wallets. One big public archive wallet that could be used for any site, and then within that, ring-fenced wallets that could only be spent archiving specific sites. So could be both!

Nigel · April 21, 2021, 1:19pm

That would make donation to the cause so much easier and manageable! I dig it.

Southside · April 21, 2021, 2:13pm

I would have thought that promoting a tightly controlled state-sponsored propaganda outfit with a history of repression, outright lying and paedophilia was somewhat at odds with the ethos of the project - said he in a gentle aside
Apart from Attenborough, what else would you want to preserve from that bastion of the establishment? The one that tries to pass of utter corruption as “cronyism”

JimCollinson · April 21, 2021, 2:17pm

How is that promoting? Surely you’d want information that could be deemed lies, preserved as much as information you hold to be the truth? Perhaps only historical preservation can reveal the difference?

Southside · April 21, 2021, 2:19pm

A good point Jim, it was an aside rather than a dig

We’ll keep the Attenborough stuff anyway - and every last bit of output from John Peel too.

davidpbrown · April 21, 2021, 3:22pm

Just came to suggest the IA as stumbled over its being more than just websites… there seems to be a lot of content that is free books, movies, software, music.

Edit: and just reflecting that for a lot of these perhaps the idea of switching costs from perpetual storage to engaging with what is new and uploading more volume as a result, might be very appealing. I notice Gutenberg is slow and wonder that any host of data will worry about their ability to keep up with the costs of storage, where Safe Network will allow a simple sell of do it once and do it well… and worry not.

Southside · April 21, 2021, 4:29pm

For me, the Internet Archive is where I can listen to just about every Grateful Dead gig they ever played and a vast array of other bands, especially Little Feat.- for free, secure in the knowledge that these artists put this material up there for their fans.

mav · June 3, 2021, 5:57am

https://www.nature.com/sdata/policies/repositories

Scientific Data mandates the release of datasets accompanying our Data Descriptors, but we do not ourselves host data. Instead, we ask authors to submit datasets to an appropriate public data repository. Data should be submitted to discipline-specific, community-recognized repositories where possible.

Repositories for primary data deposition listed on this page meet our requirements for data access, preservation, resource stability, and suitability for use by all researchers with the appropriate types of data.

We provide an archive of our recommended repository list, which is available for use under the CC-BY licence. Recommended repositories and standards that are indexed by FAIRsharing can also be viewed and filtered via the Scientific Data FAIRsharing collection.

JayBird · June 25, 2021, 10:23pm

10,727,990,597 source files from 162,481,510 projects. Holy moly

mav · June 30, 2021, 12:09am

Great podcast interview with Max Roser from Our World In Data.

A lot of data sets are mentioned

https://www.gapminder.org/
https://www.worldbank.org/en/publication/wdr/wdr-archive
https://www.carbonbrief.org/
https://www.nature.com/sdata/
https://data.worldbank.org/
https://stats.oecd.org/

Some interesting dialog about data licensing:

MR: It used to be the case that this [World Bank] data was licensed under very restrictive permissions, only available if you order a DVD and so on. And Hans diagnosed them with, what did he call it? ‘Database Hugging Disorder,’ DHD. And he cured them of that.

RW: Interesting. Who was licensing it through a DVD? You mean Gapminder, or the World Bank?

MR: Oh, the World Bank and other UN organizations. Back in the day, they weren’t making their data available in this way. And it’s still the case for one very important data source, the International Energy Agency. That’s a partner organization of the OECD.

They produce some of the most important data in the world. They produce the global statistics on energy and climate change. And the world needs to have access to these data sources. But if you want to have access to the full data of the IEA, you pay licenses that are costing several thousands of euros. And that also means that institutions like us, but also journalists, can’t straightforwardly rely on their data and communicate that. And so we are in a situation where the best statisticians on energy produce these figures, and then they’re locked away behind a paywall. And instead of using these figures, the world relies on the data from BP, from the gas and oil multinational. They’re producing the energy stats. And so we have largely publicly funded data at the IEA that isn’t available for the public. And we have a private oil company that is producing the data that everyone relies on.

Everyone tries to work their way around this issue. So you have researchers that can’t share each other’s work. We had several of these issues where we would have access to some data, we would analyze the data, and then we can’t make it publicly available. If you make it publicly available, even in a chart or so, you get several emails from several people that ask whether you can possibly share that information with them, and you can’t, because the licenses don’t actually allow this, so that every other researcher is doubling down on this effort, and everyone is trying to do the same analysis, and is trying to avoid these restrictions with the IEA.

mav · October 4, 2021, 3:00am

An amazingly informative resource here:

https://www.archivematica.org/en/

Archivematica uses METS, PREMIS, Dublin Core, the Library of Congress BagIt specification and other recognized standards to generate trustworthy, authentic, reliable and system-independent Archival Information Packages (AIPs) for storage in your preferred repository.

Compatible with hundreds of formats

In the Format Policy Registry (FPR), Archivematica implements its default format policies based on an analysis of the significant characteristics of file formats. The FPR also offers an editable, flexible framework for format identification, package extraction, transcription and normalization for preservation and access.

Memory institutions have dedicated voluminous resources over the past couple of decades to implement various software platforms to manage digital objects. For this reason, we believe in leveraging the strength of other tools and integrating with them wherever possible.

dSpace

CONTENTdm

Islandora

LOCKSS

AtoM

DuraCloud

OpenStack

Archivists’ Toolkit

Arkivum

ArchivesSpace

Topic		Replies	Views
People follow data... bring data to SAFE early, pre-launch. Lets put Wikipedia out of its misery. ;) Marketing	4	1256	July 20, 2017
Indexing public data on safe Apps	3	1102	January 16, 2017
Storage of Government data Beginners	9	584	February 22, 2019
How do we better educate the public on SAFENetwork? Sway opinions and such Marketing	17	1298	May 14, 2018
Public notebook on everything about SAFE Network Community	37	2318	March 10, 2024

Public Datasets on Safe Network

Idea 1

Idea 2

Related topics