Disclaimer: I started studying data analysis about 7 weeks ago in an attempt to change career, I’ve never been much of a tech guy before so I am very much a beginner and this may be way off the mark…
The concept of a ‘single source of truth’ seems to be very trendy in data analysis, AI and business intelligence, but i can’t find anything about SSOT for open data and I am wondering whether anyone thinks this may be a potential application for the safe network.
There is a huge amount of data out there in the public domain and one of the main obstacles to mining it and drawing fresh insights is that its in so many different formats and often isnt compatible.
It seems to me like this is exactly the problem which also occurs within organizations that deal with large amounts of data which leads them to seek to build a single source of truth. The idea is simply to make disparate data sources compatible through centralization.
Three whiskys into this evening (yes its scotch which is appropriate) and Im wondering whether this concept of a single source of truth can be applied beyond a single organization and whether something like safe network might not be whats needed to build it.
At first sight my fitst guess would be that maybe this would take the form of a standardization effort of sorts, which has its dangers but if its collaborative could potentially draw on multiple sources to dynamically create a single source of truth based on the combination of inputs.
What could this look like?
Set specific formats for a given metric. Eg GDP per capita: is it local currency, USD, or something else like Valued in gold and so on.
Connect multiple open data sources with procedures to harmonize the data according to a shared standard.
Take some central measure of submitted sources as the truth.
Store the SSOT along with a list of all compatible open data sources on Safe.
Simply storing a repository of already compatible open data souces with information about shared attributes may also be useful.
The main benefit as I see it would be to make it easier for either AI or human analysts to map connections across multiple open data sources and perform a broader analysis than would otherwise be possible.
My only question is this: am I drunk already or is this the seed of a good idea?