Indexing public data on safe

https://tika.apache.org/

I’m hoping as things progress on the development of the safe network that there is more talk of search on safe. Apache Tika apparently has many uses including scanning the deep web for much more than text files but many different data types, as much of what we know of the web and it’s dark underbelly, comprise of much more than just text files. This can be useful for indexing different types of public data.

@Tim87 Has made a previous post and written up an RFC regarding indexed structured data that is a good read. Indexed Structured Data I mainly am posting this to try to envoke more discussion on the topic from others more technically adequate than I :stuck_out_tongue: and to share what info I find that may be relevant.

I’m curious to see what the end solution will be. Will there be a specialized data type? Will we be crawling, accumulating, indexing? Whatever the case, the safe network is the platform to take the next step, in any case. So I feel that naturally semantic search should also be a huge part of the search discussion. If this post is too premature feel free to bury the thread but obviously please share any thoughts!

8 Likes

Me too! Because the nexus of so much of what we do in life and expect from tech especially in terms of core empowerment comes down to search. Its at the nexus of AI-awareness-goal seeking behavior- education-research- better legal results and the rate of cultural and societal adaptation. Its become the interface for much of what we do in society and primary to the way people understand the world.

Removing the ability of money to taint and twist search and removing the conflicts of interest is vital. It can’t be sponsored search much longer, deep irony that it ever was. But its the difference between a society based on misinformation and one based on information. Its the difference between a world where children are taught to value lies, fear and coercion and one where they are sought to value truth, trust and peace.

2 Likes

Ah cool. Thanks for posting this @nigel, I had totally missed indexed data there!

I’ve been toying with a search-y setup. My main thoughts being that anyone could maintain an index which anyone else could GET (and so GET rewards for maintaining a relevant index). Users could do this manually, set up their own crawlers etc. Whatever they want. And then you could have a way of rating an indexes reliability on a given subject etc.

These could be linked to others via keywords etc and pulled down to the user where a search would be done on the local machine. (In my current POC)

The advantages here are totally distributed search… maintained by no authority. Anyone can get in on the act. And that your search doesn’t go to any service etc. So you maintain privacy over all your search history.

There’s obvious limitations on scaling though when everything is a GET request for data. But it miiight be a useful way for setting up some initial search capabilities. (just got to find the time to finish it :expressionless: :stuck_out_tongue: )

Something like @Tim87 is suggesting would be grand. Being able to offload the search burden to the network would be the ideal.

4 Likes

I can’t wait to see the POC finished! Truly distributed indeed and something is better than nothing @joshuef!

Once compute is available for public data that would open the door for developers to create their own search services in a manner that would be able to maintain indexes handled by the network but would likely require payment to handle the computation by nodes.

Although I like the idea of the network dealing with a specialized data type as it would be truly distributed (no authorities) just as your POC and could potentially be cost free. I wonder how this data type would be handled, as in, would only high ranking nodes (fast/reliable) handle the indexes or would it just be open to anyone and have similar scalability issues. Any additional thoughts are welcome :slight_smile:

1 Like