Brainstorming decentralized search on Safe

TylerAbeoJordan · September 19, 2020, 7:35am

I contend that decentralized search is MORE important than decentralized storage and that it is the key to censorship resistance on any network.

Today, while various platforms are censoring by blocking storage (e.g. youtube), there are still many platforms that are open and we are still individually able to create our own site and post information on it.

The real problem facing humanity in terms of access to information is simply being able to locate it.

The last twenty years google has dominated this arena and I believe that as they are controlled by humans who are all self-interested, google and really any public accessible cataloging system is going to be manipulated in order to push the political and economic views of those who own and control them.

Hence I am creating this thread to discuss ideas for decentralized search on the Safe Network.

I’m skeptical that such could be implemented before beta launch, but perhaps we can come up with ideas and work toward some sort of hypothetical post-beta implementation over time.

davidpbrown · September 19, 2020, 7:45am

Some tasks cannot be fragmented?

The complexity of a search base that can do the work ahead of a user search, cannot be easily shared; so, decentralised tempts duplicate work. Certain approaches would spawn difference in the result each node would present… but perhaps that’s not an issue if they are essentially the same… but if they are learning from what they encounter, then drift.

Natural data is hard to work with because it’s falls down the 80:20 rule fast, devil in the detail problem then for decentralising.

It’s possible to create a base reference set and share that… but snapshoting an instance and putting it elsewhere, is still centralised.

If you are talking decentralised, then you are talking many people doing the same task.
The task of search is work up front which cannot be fragmented if it is to consider what it sees relative to what it knows… because there is no collective knowledge.

The fallback is variety of search engines… but where there is a lot of work, their tends to be few players.

Also, from the users perspective, they would want to trust … and will tend to one solution they know.

TylerAbeoJordan · September 19, 2020, 7:53am

My thinking is that a publisher could add a source link and tags to a source list either managed by the network or an app that is provably code-locked. And then the public will rate that information related to each tag … and that rating information would inform the order of search listings. The network or app would give some small reward for ratings while taking some small payment for generating search lists.

davidpbrown · September 19, 2020, 7:56am

I wonder @david-beinn has been down the same rabbit hole a few times

but if a site provided a site index, trusted or whatever, how does a user receive a reply to their request?

Normally the user does not know which site they want… they know the search phrases they are looking for.

It’s the compound of search phrases that is unknown and too many to preempt. So, normally you would overlap result sets for known phrases within the search request.

TylerAbeoJordan · September 19, 2020, 7:57am

Additionally, I think the search itself could have various algorithms available to the user/searcher to select from and that these algorithms could be personalized by the user. possibly search algo’s themselves could be ranked by the system users in terms of various metrics and users could share their own.

TylerAbeoJordan · September 19, 2020, 7:59am

tags supplied with the source link become the base of the source lists which could then be searched against with the keywords of the user. Not functional for long phrases, but maybe tags could include short phrases.

davidpbrown · September 19, 2020, 8:08am

It’s a nice idea… until you try.

In theory the network could hold lists for each word and noun phrase, that are lists of sites and locations holding those.

The problem is that the doing of that is a huge amount of small files… very small… you run out of inodes before disk space. Compounded then with the number of updates.

So, my reservation is that I’ve yet to see that the network can handle small files. Each lump of data I wonder is 1MB… and that’s a stop on this idea. In the case the network can handle dust, then it’s worth thinking about. A site could sign up to a trusted dictionary, and the owner pays to see that site indexed. The index then only as good as people paying for it… and the cost of writing many files could be a problem, unless the network bakes in this for free.

davidpbrown · September 19, 2020, 8:36am

and just to note the other thought…

If it can be done better off network as a service, then is there real advantage in being decentralised. Whoever is coordinating it, might be one and the same regardless.

The difficulty is that I’m unclear the option to provide a service. So, a user makes a request, how do I receive that - and how do I reply?

TylerAbeoJordan · September 19, 2020, 8:49am

I don’t know the hard limits of Safe as it stands, so I don’t know. Perhaps if Search is an app in a provably locked account, then that app can aggregate enough tags into a file to find a balance between size and speed based on tests by the app coders. Perhaps there is also a fee to the publisher of new links to cover the costs of creating new files.

EDIT: in thinking about it more, I think it either has to be part of the network or we have to have Safe-compute feature which won’t come until post-beta if at all. As even if code is locked on the network, it could still be modified on users computer … I had a discussion about this before here on the forum a while back (maybe @neo).

I’d personally worry that if it’s off-network, it would come under attack and also I’m not sure if there would be any means to strongly prove results are not being manipulated.

davidpbrown · September 19, 2020, 9:10am

Yes, if you’re only trusting the network, then you have a problem for any service.

Content needs to be findable and we use what tools we have.

JPL · September 19, 2020, 9:32am

I would suggest borrowing what’s already out there if possible and adapting it for Safe. Have to say there doesn’t seem to be massive progress in this area, which, given the centrality of search, suggests it’s a pretty hard problem. As mentioned above, @david-beinn has been looking into this area so he’s undoubtedly looked into it further, but here are some links to related projects which might be useful:

https://cybercongress.ai/

jlpell · September 19, 2020, 11:28am

The patent for the original google pagerank algo has now expired.

JBishop · September 19, 2020, 2:32pm

The delimiter used in “cyber~Congress” is a ‘tilde’? Urban Dictionary:

tilde

A very positive general exclamation. Derived from the tilde (~) character on a computer keyboard. It stems from young computer users in highly excited states having a tendency to miss the exclamation point (!) key while typing, resulting in tildes becoming mixed with exclamation points in their writings.

Github:

cyber~Congress’ white paper says part of the genesis block CYB tokens will be awarded “as gift for Ethereum, Cosmos and Urbit communities”. Hacker News on “Urbit”, 9 months ago:

Urbit (urbit.org):

Some of Yarvin’s writings are at best eccentric, but as he has not been a part of the project for some time already, I think it’s fair for today’s contributors to the project to ask you to set aside his intent, or your impressions of it.

“Some time” is pretty vague. He left the project earlier this year, and the project is still named after his fascist blog.

I asked one of the Tlon employees in this comment section why the project is still named after Unqualified Reservations (UR), and so far haven’t gotten a response. I think it’s fair to expect a project with this history to address it head on and actually denounce the views of Yarvin/Moldbug if they are trying to be viewed as non-fascist.

A project can be apolitical, but a project with a political past doesn’t get to act like it’s apolitical just because the founder steps away on good terms and nobody else is blogging a racist and anti-democratic ideology. If the current contributors don’t agree with Yarvin/Moldbug’s intent, it seems fair to expect them to actually make that clear.

That bad? Not Yarvis as Yarvis:

urbit/old-urbit.org/community/articles - Welcome to Urbit

~tasfyn-partyv [Curtis Yarvin]

But wait - what the hell is Urbit?

One of Urbit’s problems is that we don’t exactly have a word for what Urbit is.
[…]
Not only is there no such word, it’s not even clear there should be one. And if there was, could we even hear it? As Wittgenstein said: if a lion could talk, we would not understand him.

Not that decentralized either, at Urbit.org they have:

Governance structures

The interim republic has four branches: an executive consulate, a galactic senate, a stellar congress and a planetary assembly.

For the interim, full authority is held by the (Roman style) consulate. The legislature (senate, congress, assembly) is advisory. The senate is never consulted and the congress is almost never consulted.

[…]

The congress of stars is designed to exercise project and community governance. Its proceedings are in private. The consuls convene it at their pleasure, and keep it informed.

“Galaxy table”:

95 galaxies are held by the Tlon Corporation. 50, reserved for urbit.org, the future community foundation. 40, by Tlon employees and their family members (24 by Curtis, who started in 2002; 16 by everyone else, who started in 2014). 34, by outside investors in Tlon. 37, by 33 other individuals, who donated to the project, contributed code or services, won a contest, or were just at the right place at the right time.

Note that while Tlon and its employees still control a majority of galaxies, the senate is a ceremonial body; the consulate is effectively equivalent to Tlon.

[…]

The first consuls are Galen Wolfe-Pauly and Raymond Pasco.

Wikipedia about those two and their Tlon Corporation:

Urbit

The Urbit platform was conceived and first developed in 2002 by Curtis Yarvin. It is an open-source project being developed by the Tlon Corporation, which Yarvin co-founded in 2013 with Galen Wolfe-Pauly and John Burnham, a Thiel Fellow.
[…]
The company has received seed funding from various investors since its inception, most notably Peter Thiel, whose Founders Fund, with venture capital firm Andreessen Horowitz invested $1.1 million in 2013.

Wikipedia-Urbit has a paragraph also on “Politics and controversy” and a few more about "American far-right blogger Yarvin (1973) aka. Mencius Moldbug, “often associated with the alt-right” but can I do an alt-escape here? o/o

david-beinn · September 19, 2020, 6:01pm

Yeah, fully agree that search is a massive part of keeping things decentralised.

It’s not easy though, and as @davidpbrown suggests, it naturally lends itself better to a more centralised model in terms of efficiency.

There are various ideas over on the SAFE search app thread, but I’ve instigated most of that discussion, so feel free to look at it from a different angle.

I’m really busy with other stuff just at the moment so not much time to engage, and like I say probably worthwhile folk looking at it from other angles anyway.

Those links @JPL has put up look like they’ll contain some fairly interesting ideas too…

neo · September 19, 2020, 11:27pm

The concept I had thought of was that the publisher does the hard yards of spidering their site and via a standardized protocol they make this available to the search engines out there.

So then a search engine would take the site from the list of new indexes and build their own database which obviously would index into each of the sites indexes. So one may cater to one style or language or ??? and another may want to be general

Maybe that process can be done distributed with some reward for doing so.

davidpbrown · September 20, 2020, 6:11am

There’s no added value in a site holding an index that is only that site.

Search works for the integrated indexing across sites. Lists of all sites that contain noun phrase, and then when a user searches, those lists are cross matched for where a site contains all the user is looking for - or falling back to the closet matching… extended to couple with similes and conceptual similarity.

The publisher would just spawn a load of useless fragments … and could not do so without the knowledge base that is the search engine core… which is huge because natural language problem.

There’s a simple case that is just single words but it’s misleading for the complexity that follows… a single word only search is literally trivial but it’s equally weak for user interest, in the face of a next step which is so much more effective… and further effective capability heads down a centralised path because complexity… it becomes orders of magnitude less efficient when distributed.

The only route out is volume… where having instances of some near AI cater for many, might work… with differing results. Still, that means nothing if the network is not thinking… people default to trusting that centralised owner… and making them owners added difference for little value… unless it’s covering off some other risk… or to manage the volume of requests.

As above atm, I don’t know how a centralised service of any kind, can receive a request and reply to it. So, rather stuck atm for any option to search.

jlpell · September 20, 2020, 8:56am

The only thing that can be viewed as a centralized entity is the Safe Network itself as a whole. The only accessible view of that server is the client’s Safe Drive, the public Safe Network, and what they can affect via the browser or cli.

I really thought you were on to something good with regard to storing indices at the hash of the search terms. In the case of single word search terms this matches up nicely with my proposal for a permanent public name reserve to eliminate domain squatting.

Alternatively you create the separate search index data type with it’s own 256/512 bit hash space so all possible search terms are searchable via their hash address. Creation of the search indices is done by grass roots safe crawling with some kind of proof of work to ensure accuracy. PtP rewards pay out index contributers equally based on their contribution when indices are read. Seems like a neat approach to use economics and human intelligence to do the natural language processing. A standard Safe Crawler app would allow for some automation and alet anyone contribute their resources to crawling. Old fashion human readable lists might be a nice way to present the indices in the browser.

neo · September 20, 2020, 9:27am

The idea is the site does the work for the search engine and makes the indexing available for all the search engines out there to take and build their indexes. That saves all the spidering efforts which is time consuming.

Then the site puts a link to their index data (following some standard) in an appendable data record that all the others do as well. The search engines go through the list and process each site as it goes.

This way then the search engine (before compute) is done by one or more machines processing the site’s index they built.

Now a search engine could provide an APP that people can run that does some processing and be rewarded for it. That way it could be distributed. Or else the search engine provider has as many machines as needed to do the work.

By developing a standard for index a site allows the site owner to use an APP to do the work. There can be competing APP to index the site and competing search engines.

davidpbrown · September 20, 2020, 10:17am

That’s little added value… none for engines that will just do their preferred spider themselves regardless. They will want that for the advantages of being better than any old spider protocol.

The work that makes a search engine tick, is still centralised.

Edit: also this entirely underestimates the reality of search beyond just single words and a very limited number of phrases. The app would be calling on a huge file or a huge number of small files… which if they are not on network will be a significant problem.

mav · September 20, 2020, 11:05am

Just throwing stuff at the wall here…

Search does two things - forms a subset of relevant pages, and orders the pages.

The first part, excluding unrelated pages, and the second part, ordering the pages, are probably not that different in practice since if you don’t look at result 1038 then it may as well have been excluded.

But I think the distinction is useful when considering how to approach search in an incremental way. I think the first part is probably easier than the second part.

The ordering consideration probably means the search will benefit from some additional context to help give the most relevant results for your particular situation. Google uses the example of searching for taj mahal for why context matters. “the perfect search engine should understand exactly what you mean and give you back exactly what you want”. I think both sides of the privacy debate have a strong case here…

Very early Safe Network search could be as simple as “ask your friend who’s really into boats which pages I should visit if I’m interested in buying a boat”. If the friend has appropriate knowledge and understands your situation, and they’ve read enough on Safe Network about boats, they can hopefully provide you some useful pages.

Maybe that’s a basis for the early search, have people who know where the good pages are on the network and keep track of content really closely, building a really human-level mapping of stuff. Almost like news aggregators, that kinda thing.

Then it can be extended to pagerank style automation and ranking, which is the same thing but automated and scaled up. It’s still more like asking a friend, since the results of that ranking are presumably stored locally and not on Safe Network and we have to actually ask the friend (ie the friend would suddenly know we’re interested in boats). At a basic functionality level there’s not that much difference between asking Tony-the-boat-friend about boats and asking google about anything. Both involve asking someone more knowledgable.

The hard part of scaling up will be improved ordering from added context. Google can do this automatically because of how much they already know about each person, from their searches but also from their gmail, maps, ip etc. On Safe Network this context will need to be supplied for each search. Is the search query itself context? In some way yes, but it’s such a strong piece of context that it probably deserves to remain conceptually distinct from the other stuff.

My feeling is someone will probably design a standard format for supplying context (including techniques for privacy etc) and this context-stuff will be managed automatically by some personal-assistant type algorithm that runs on your machine and ‘gets to know you’ and can fill the search context automatically each time.

But I’m not sure how to take the final step and put the search index data into Safe Network itself. It seems to me that we will always be finding better ways to ‘ask a friend’ to tell us some of their ‘knowledge’. You could just say Tony-the-boat-friend uploads once a week his best boat links, and google could do the same for all searches, but I’m not sure how practical or desirable that would be.

It’s also worth asking how will we submit a search query on Safe Network? Would it be like sending an email to google and google responds with another email containing the results? Most likely it will be direct p2p (as in my computer to google computers), so we really do end up just talking to google almost exactly like happens now.

Topic		Replies	Views
Safe-Search, bringing content discovery to the SAFE network Apps	41	4971	February 16, 2018
SAFE Search App Apps	190	5275	February 18, 2021
Google-like searches on Safe Network Apps	31	3356	January 23, 2018
Simple Human-centric Search on SAFE Apps	13	1709	July 3, 2016
Safe Search 2023 Apps	12	779	February 11, 2024

Brainstorming decentralized search on Safe

Related topics