[Speculative] SSOT and Open Data

Disclaimer: I started studying data analysis about 7 weeks ago in an attempt to change career, I’ve never been much of a tech guy before so I am very much a beginner and this may be way off the mark…

The concept of a ‘single source of truth’ seems to be very trendy in data analysis, AI and business intelligence, but i can’t find anything about SSOT for open data and I am wondering whether anyone thinks this may be a potential application for the safe network.

There is a huge amount of data out there in the public domain and one of the main obstacles to mining it and drawing fresh insights is that its in so many different formats and often isnt compatible.

It seems to me like this is exactly the problem which also occurs within organizations that deal with large amounts of data which leads them to seek to build a single source of truth. The idea is simply to make disparate data sources compatible through centralization.

Three whiskys into this evening (yes its scotch which is appropriate) and Im wondering whether this concept of a single source of truth can be applied beyond a single organization and whether something like safe network might not be whats needed to build it.

At first sight my fitst guess would be that maybe this would take the form of a standardization effort of sorts, which has its dangers but if its collaborative could potentially draw on multiple sources to dynamically create a single source of truth based on the combination of inputs.

What could this look like?

Set specific formats for a given metric. Eg GDP per capita: is it local currency, USD, or something else like Valued in gold and so on.

Connect multiple open data sources with procedures to harmonize the data according to a shared standard.

Take some central measure of submitted sources as the truth.

Store the SSOT along with a list of all compatible open data sources on Safe.

Simply storing a repository of already compatible open data souces with information about shared attributes may also be useful.

The main benefit as I see it would be to make it easier for either AI or human analysts to map connections across multiple open data sources and perform a broader analysis than would otherwise be possible.

My only question is this: am I drunk already or is this the seed of a good idea?

4 Likes

I think the source of truth in itself goes beyond data formats or even data. Humanity has a problem with interpretation of facts in many cases and this means they fabricate a source of truth.

However, I get your point here but here is an idea. An LLM fine tuned on your data, so data you know and trust, can ingest that from any format for you and replay queries against it and supplying the source data points as well. So for data science I would tend to look deeper at that. i…e go to a place where there are no hallucinations and via RAG or fine tuning (latter is better) then you can see where this is very likely already a possibility or at least extremely close.

A trick I use in my custom prompts as the following statement

  • Please bulletpoint your answer and provide the sources

That proves to be very effective. It’s certainly better than folk who don’t do this and then find hallucinations as though the LLM is broken badly. I think the whole hallucination issue is actually weird and I suspect if we want creativity we will want them, but we will also then further in the convo disprove any hallucinations as we go. But that is how creativity works, we dream, imagine, hallucinate and dive deeper checking our assumptions, many of which turn out wrong. We never think Oh my first crazy brainstorm must be correct, let’s implement that :wink:

8 Likes

I don’t know the phrase a single source of truth, so I may not understand what is meant by it, but it doesn’t make sense to me. My worldview is one thing but the idea that there can be a single source or truth for everyone doesn’t make sense because we all have different histories, perspectives etc.

Science is perhaps an attempt to arrive at a single truth, but as we know not everyone subscribes to it or agrees what science says. Always there is space for disagreement and debate.

What did occur to me is the ability for everyone to access the data we have and understand its meaning, which makes me think of the semantic web (or Linked Data). There may be other attempts to standardise representation of meaning, knowledge etc. but that is the one I’m most familiar with, and that’s something I’d like to see Safe Network support because then people, applications etc can at least access data and understand the meaning of it, regardless of whether they agree on it’s ‘truth’ or otherwise.

7 Likes

Yeah, I don’t think it woyld be a good idea for something like this to try to provide an answer to an uncertain question subject to scientific debate. Only to the defintion of metrics.

Edit: (addition) Maybe linked data and the semantic web is what im talking about, it sounds like it might be. Ill have to look into that. Would love to see something like that on safe as it seems to me like the safenetwork would be a good place for it.

1 Like

Thats a great tip, i’ll use that thanks!

Personally i think hallucination is just in the nature of LLM. They are only programmed to sound like they know what they are talking about, not to actually know what they are talking about :smile:.

2 Likes

They share that glory with about 8 billion of us :smiley: :smiley:

3 Likes

Just had a look at semantic web and linked data, thats sort of what i was talking about the only difference being that seems to be something people do prior to publishing data, whereas what i was thinking was an app to take data thats already out there and the clean and transform it into something like that before storing it on safe.

2 Likes

Hahaha true dat.

1 Like

I think that perhaps what is needed is an application that can convert to a particular desired standard from existing sources. So an LLM isn’t needed here - although perhaps useful in creating the app.

SAFE could certainly be used to store the output though.

2 Likes

That is a major problem with SSOT for human activity. No two people describe a car crash the same, everyone misses most of the multitude of facts and everyone adds a slant on the facts they do express. In fact we need a multitude of sources of truth to be able to get a more factual picture of the truth.

But as you say this also allows error and manipulation. Error - not the right people interviewed and assumptions may be used to cover a certain set of facts because those people simply got it wrong/lied.

This was a major factor in many relational databases where the facts were stored in one table (different tables for different “facts”) and not multiple tables for one “fact”. But so much storing one fact in multiple tables was for the sake of speed. Summarised ledgers and so on often have summarised facts rather than going back to the single source of itemised figures and calculating the summary each time.

The compromise often was to update all summary figures if any of the original source figures were changed or added to/subtracted from, but errors can creep in doing this.

The problem as always is being able to without error determine the true single source of truth. Which is why most work by using a consensus of truth and work from there, one reason for quoting sources because most sources are also a collection of facts.

Maths is another consensus where we all agree on the set of rules and assumptions that maths operate under. Also trial and error help form those rules. eg enough people have taken one of an object and then taken another and it always ends up being what the consensus for the figure 2 is for.

TLDR
you need to be able to decide on what you mean for single source of truth since in the real world this is rare indeed in the absolute sense.

All in my opinion and understandings - no way meant to be a SSOT or quoting a SSOT

4 Likes

My interpretation is that SSOT isn’t for comparing data, it’s for ease of human consumption - hence the emphasis on common formatting.

Of course, humans do compare/weigh data, so that’s always going to be there.

2 Likes

Minor point but worth I feel, objecting to. The former is better. Big models are expensive to train. When you fine-tune you teach is a small, new task. Like, how to tie a show lace. The model knows about shoes, knots and hands and more importantly, how words come next. You’re adding a tiny task to its capabilities.

RAG makes a semantic store of the data you are say is important. A question is asked by a user which first checks for matching documents in the semantic store and passes these and the original question to the model. The model looks for the answer in the semantic store results, with the power its knows about words that come next.

2 Likes

You are right in some respects and I was wrong saying it was always best. I did also just ask chatgtp for more info as well, just to add to the convo (it’s all new and very interesting)

ChatGPT

When determining what is best for accuracy between Retrieval Augmented Generation (RAG) and fine-tuning models, it’s important to understand the context and application for which you’re using the language model.

  1. Retrieval Augmented Generation (RAG): RAG combines a pre-trained language model with a retrieval system. It retrieves documents or data relevant to the input question or prompt and then generates an answer based on both the retrieved information and its own knowledge. RAG is particularly effective when accurate and detailed responses are needed, and when the answer might be contained in a specific set of documents. It’s used in situations where having the most up-to-date information or accessing a wide range of sources for cross-referencing is crucial.
  2. Fine-Tuning: Fine-tuning involves taking a pre-trained model and continuing the training process on a new, typically smaller, dataset that is specific to the task at hand. This method is effective when you have specific, niche tasks or you want the model to adapt to a particular style, tone, or domain of language. Fine-tuning can significantly improve performance on the specific task it’s trained for.

Factors to consider:

  • Task Specificity: For general tasks, fine-tuning a large pre-existing model on a smaller, task-specific dataset might yield better results. However, for tasks requiring up-to-date or expansive external information, RAG might be more suitable.
  • Data Availability: Fine-tuning requires a relevant dataset for the specific task. If such a dataset is not available or is too small, the benefits of fine-tuning might be limited.
  • Accuracy vs. Freshness: If the most accurate answer depends on the latest data or information, RAG might perform better as it can retrieve the most recent documents. For more stable domains of knowledge, fine-tuned models might provide more accurate responses based on their training.
  • Computational Resources: Fine-tuning and running RAG models can be resource-intensive. Consider the computational cost and availability of resources when choosing an approach.

Best for Accuracy?

  • There’s no one-size-fits-all answer. For tasks where the information landscape changes frequently or where the query is about very specific, recent, or varied topics, RAG might be more accurate. For tasks where there’s a well-defined scope and a specific style or domain knowledge is required, fine-tuning might provide better accuracy.

It’s also worth noting that the field of AI and NLP is continually evolving, and hybrid approaches or entirely new methods may offer improved accuracy for different tasks. Experimenting with both methods and perhaps combining them could yield the best results for your particular needs. Consulting the latest research and case studies specific to your domain can also help guide the decision.

5 Likes

Shoosht and huv a dod o shortie

1 Like

Whether to fine-tune a model or use RAG is complicated, and there are long papers written on this, but a rule of thumb is that you may want to use RAG for factual statements that could potentially be stored in a database, like “Bob is 20 years old”, and fine-tuning for procedural knowledge, like recognizing specific objects or generating text in a particular style.

In regards so single source of truth, there’s two layers to this. First there is where the data is stored. You may have an SQL database as the single source of truth often your data and this data may be copied to other databases with other querying or analysis capabilities, like elasticsearch for full-text search. As I understand it this is mainly what is meant by SSOT.

The other layer is provenance, or the original source of some statement. If you have a knowledge graph that contains a statement saying that Bob helped Alice on project A, if you follow the whole provenance chain ultimately you may find it’s from a document containing meeting minutes that someone later entered into the knowledge graph.

For open data these blend more into each other. When you have a single source of truth within an organization, that is simply some datastore that is chosen to be the source of truth, so whatever is stored there is “true”. If an SQL database is the single source of truth, then an elasticsearch index may potentially be out of date as the SQL database is where updates go first. With open data it’s more complicated, but the semantic web community has thought a lot about this and the way to deal with it is you may have statements copied into knowledge graphs all around, but if you have kept track of the provenance you will know where each knowledge graph sourced it’s data from and ultimately even where they knowledge graph that first stated it sourced it from, such as some document.

Something related I’ve been working on lately is using LLMs for tracking the original source of some statement through multiple documents. If you have a collection of documents, several of these may say that Bob helped Alice on some project and you may want to know which of these documents was the original source. What you do then is first constructing a knowledge graph based on the set of documents. This is done pasting a schema that describes the types of data you are interested in, for example who worked on some project, into the prompt to the LLM together with the document and a prompt asking the LLM if it can find any data matching the schema in the document. Usually the document will also mention a source of the statement that you can then also add to track the original source of the statement.

For videos on constructing and querying knowledge graphs through LLMs, I can recommend Johannes Jolkkonen’s YouTube Channel.

For more on SSOT on open data I recommended searching for “semantic web provenance”.

4 Likes