SAFE Search App

JPL · December 4, 2019, 1:31pm

Definitely! The technicalities are beyond me, sadly, but some sort of search/discovery engine is an absolute essential IMHO. Great to see someone grasping the nettle.

david-beinn · December 4, 2019, 1:38pm

Thanks, I’m pretty light on the technicalities myself, just trying to do it as simply as possible and see how far I can get!

Josh · December 4, 2019, 3:48pm

That is how it all begins.

joshuef · December 4, 2019, 6:12pm

hey @david-beinn! This is great to see. I’ll be happy to help out where I can.

Re: reserved tag type. I’m not sure we need one. We can just use a type tag and claim that as our own for search indexes. Either way, the app will still need to be checking that the data isn’t corrupt, as who knows what app PUT data at that type tag anyway.

Re: Pweb. Aye, I don’t think ‘on this date’ will be possible purely by pWeb. But the indexing app could apply timestamps which would make that possible. (though how/what to trust is a different Q). Versions are normally just as you say, current -10million or what you have. Just a list of new versions of a sites data.

Some thoughts:

The approach of using browser data is great. It doesn’t have to be restricted to bookmarks/history, though that’s a great start. I kind of imagine everyone’s browser doing some indexing as you go.

Down the line pulling this info from trusted contacts will be a nice way to augment results.

I think the most effective way of doing this will be for folk to dedicate some resources to doing the indexing of all the data you highlight (bookmarks, history, site info) to produce an index, which can be shared…

I imagine this happening on the individual level at first. Maybe a web app to ‘make my index’ from info produced by the browser.

You could then have your octo search page which will use your own index by default.

At the same time, we if reach an agreement about how a ‘site info’ schema might look, this could be applied automatically as part of site creation tools. (Either maidsafe, or external tools… the important thing is just creating a draft standard to improve upon). Which we can hopefully make the generated indexes easier / simpler.

With that, you’d have a basic form of ‘search’. And then it becomes a question of enabling using other indexes. And (i think) the ability to reference other indexes. So folk can start really trying to produce a ‘good’ index which is consumable by octo (or any other search app)

intrz · December 4, 2019, 6:12pm

In the first implementation you have the site send the index, maybe it could continue to do? When updating a site the site owner would send an updated index for the site with a timestamp and version to larger indexes/search engines that subscribe to the site. Then there would be no need for crawling on a regular schedule as it would be the responsibility of the site authoring software to send the updated index when the site is updated.

Then the index has a version and a timestamp, so you could search current version by time or just some specific version of the site.

david-beinn · December 4, 2019, 7:30pm

Thanks for thoughts, @joshuef, @intrz.

No longer so concerned about spam, as my thinking here is that the ‘Octopus’ account would be the only one with write permissions on any of the ADs.
My concern would be that it would be easy to squat an AD because the formula for its address is fairly crucial. The search app needs to know that it can go straight to the hash of the search term and find what it needs to there. If someone wants to take ownership of the ‘Apple’ AD before Octopus does, there is nothing stopping them.

I guess the question is whether time-stamping or the approach I suggested is more efficient. Is there a way for the user’s browser to quickly search for the version with the required time-stamp?

I’m really intrigued by this approach of everyone keeping their own indexes, which I think you’ve mentioned before, but it’s not what I had in mind here!

Genuinely open to suggestions of how that could work at speed/scale though, because that would be a more decentralised solution.

To explain the approach I’m suggesting here better though, the idea is that just the ratings for any given page are taken from the consenting client’s accounts.
The bones of the index are generated from site owners asking (and paying for) the central ‘Octopus’ apps to index or re-index their page, based on their mark-up. My hope would be that eventually this centralised computing would be done by the network itself.

However, as soon as we talk about scale, this indexing cannot be completely objective unfortunately, so Octopus has to choose the terms of the ratings, even if they are generated from browsers’ data. The only thing that would mitigate this is to make sure that process is very transparent.

(eg. search for ‘Apple’ and Google returns about 8 billion results. The only way I can think of to handle that using the browser’s computing ability is to have them all ready to go in ranked order at a location on the network, so the browser can just return pages held in the top rank key slots. This is where it gets speculative and my knowledge completely runs out - Can we mount a very efficient data structure on an AD that allows also for insertion and update of entries, whilst maintaining things in ranked order? eg, red black tree, adaptive radix tree.

Honestly, I love this sort of ‘chinese whispers’ approach, I just can’t begin to see how it could quickly summon the quantity and quality of information we’re used to, but hopefully I’m missing something??

Feel like my approach is somewhat disappointing in comparison!

If I’m understanding right, I think this is what I had in mind. The crawling (or scraping) would only be done when the site owner initiates it. In some circumstance this may be on a regular subscription basis.

The regulated schedule is purely for the purpose of keeping track of time, and happens in the process of creating the inverted index (timed) from the info about each site (not necessarily timed.) In that scenario, site owner would say, I want my page indexed right now, Octopus would scrape the page for keywords etc, calculate the cost of indexing the page, then on payment add it first to the list of pages, including full info about it. Another Octopus app would then read that info page when the next minute ticked around, and deposit its address, (and some ratings info) to every relevant Keyword AD. There would be no need to touch this info page again unless the site owner asked for the page to be re-indexed, or if the ratings changed. Ratings would likely be updated on a much slower schedule than everything else, but I haven’t thought too much about that yet.

isntism · December 5, 2019, 6:03am

I’m not convinced that we’ll get a google-level search system going on the network without servers with billions of results, instead we can start focusing on individual websites, who needs all those results anyway, few people stray from the first page.

As for the chinese whisper aproach, can’t you just reference other indexes, which then get crawled too, before doing whatever search engines do to databases to get results.

That would even fix the “who pays for it” problem, if a site is usefull enough someone is going to put it in their own index, which is referenced in others, there are plenty of people who’d do that, it would make it difficult to keep the quality up though.

Especially if someone starts mislabeling websites, one step at a time though.

ps. I did have trouble understanding your graphs.

david-beinn · December 5, 2019, 7:57am

Hi @isntism, interesting thoughts, thanks.

Sorry the diagrams are not so clear. They were clear in my head, but I suspect you’re not the only one who’s not picking them up so quickly!

On one level I would agree that we don’t necessarily need the billions of results Google gives us. However, I do think it is a good idea to try and index as much as possible to enable more obscure searches, and make sure less popular sites are not completely excluded. Ideally I think you should always be able to find something if your search terms are sufficiently specific. On that basis I think we quickly get into very large numbers.

Whether what I’m suggesting is the right way to deal with that though, I’m not so sure at all, which is why I’m really interested in the approach that @joshuef is hinting at.

What I would say is that I think on the payment side, mine works quite well as a model. I would argue that it is people who have built a web page who have the incentive to pay a little bit extra to make their site discoverable, and on the whole it really would be a small amount extra. Shifting the payment to the searcher side would be very difficult to pull off I think (but not impossible)

However, I’m starting to understand a little better perhaps how that model could work.

I guess the idea would be that when I come across a decent page I hit the ‘index this page’ button, which adds it to my own index, which is in the same format as the index in my diagram, with a set of AD s for keywords (inverted index) and one with the full details of the sites. That would also put a reference to my index in the more centralised inverted index for every keyword. The only problem then I suppose would be how do we order those references? How do we know whose index turned out to be most useful? Alternatively we could just traverse our neighbours’ indexes. But how do we know who our neighbours are, and whether they might be more or less likely to have something for the term we’re after?

Would this be a correct interpretation of what you’re suggesting @joshuef?

I think I’m slightly less sceptical of the possibility of having a comprehensive search system. The XOR address look up is a pretty powerful thing from what I can tell, so if we can just narrow things down to being in the right place, I don’t see why we can’t get to them pretty fast. Again though, that comes with the caveat that I’m no expert.

joshuef · December 5, 2019, 9:04am

Aha, right. I hadn’t really understood this was the strategy.

No, there are no timestamps in versioning. That’s why iIwas suggesting it could be up to the app if that’s desirable

Thanks for clarifying your approach. I was definitely missing some of that.

There’s not so much different in the approach, I think. Search term ADs are a similar concept. I think maybe a difference is in that instead of Octopus/the network being the sole arbiter of the indexes. We could have many competing indexes, as long as the output is standardised.

So then we don’t need reserved AD, any indexer can create / reference the data they need. And they would be producing searchable index, or a portion of one.

Thus there’s less need to be completely unbiased, as anyone can switch out their desired central ‘search’ index, so competition would (hopefully? perhaps?) force more transparency.

It means the onus is on the indexer for indexing (though I do like the idea of site owners contributing, i think that’s great), paying for computation / data etc. But they also own that data. Which means they can get rewards for it too.

Anyway that’s the crux of my thinking. I really think it’s not sooo far away from what you’re talking about here.

joshuef · December 5, 2019, 9:08am

Yeh i think this is important to keeping things decentralised.

It could also mean that knowledgeable folk, could create very niche, but very informative indexes (and be rewarded for them), and that other larger indexes, doing things in a more automated fashion could rely on this too.

As the network grows I agree some form of computation will be needed (perhaps with a server in the interim), but that searches don’t need to leave the network per se. Once we have messaging etc comms with a large index for fast results should be easy within the network, so we may still be able to get such services (and perhaps with micropayments etc for their time.)

joshuef · December 5, 2019, 9:12am

I think web of trust stuff will become useful here. (Project decorum eg). But at a simple level you could go off our own contact list.

Knowledgeable folk with high reputations will appear in all areas over time.

Aye, this is where I’m leaning myself

I dont think this has to be deterministic. You can choose to use a given index which would point to the data, so you just have to download the index in advance to get the jump start on finding your search terms.

david-beinn · December 10, 2019, 1:50pm

Hi @joshuef, thanks for all the points. Sorry for the slow reply, been applying for a creative writing course, so my head’s been in a slightly different place!

However, I’ve been thinking about how to make a fully decentralised search like you described work, and I keep coming back to similar sticking points.

For all it is obviously preferable to remain decentralised, I feel like in this case the downsides of pursuing that would outweigh the benefits even if it was successful, and in the scenario that it wasn’t successful, it would open the door for a fully centralised solution like Google to come in and take over, and then in many ways we’re back to square one.

Firstly, it’s clear we have to correctly incentivise people to participate in indexing, and I find it very hard to see how that can be done whilst maintaining trust in the index. We have to reward both quality and quantity, and making it so neither can be gamed is very difficult. It feels like we would end up with a very complex system that it is constantly trying to stay one step ahead of the scammers.

The dream is I suppose that one builds an index for one’s own use, and then shares it, but I find it difficult to see the attraction for anyone to do that. Your own index is by definition what you have already visited, and unless it is held locally (which would defeat the object) has no speed benefit either. Assuming it is based on indexing sites we like, that only really allows for a binary vote of like (or not bothered,) which is fine if we are aggregating the results into a centralised database, but is not so much use if we are trying to just traverse.

I think it would be complicated to do it this way on a technical level at any larger scale, so we would end up with a very complex system, quite probably with a poor UX, that then runs up against the same insurmountable conceptual problems in the end.

I love the idea of introducing more human values in the way people are recommended and referred to sites, but I think the proper place for this is a more social level overlaid, such as you mention Project Decorum. Someone will fill the role of a big, Google style search, and I think it would be preferable to get in there and do it in a way that is as in keeping with the values of SAFE as possible, even if not idealistically perfect.

The actual computing that the ‘Octopus’ apps do in the way I’ve drawn it is very simple, and would be very amenable to decentralised computing, so the site owner would eventually just be asking a vault to index their site for them. It’s obviously been hard enough building a trustless network, so we might as well use it!

In addition, just in case it’s not clear, in the way I’ve drawn it, all actual searching is done within the browser, so that’s decentralised.

The biggest controversial ‘centralised’ area would be how rankings are calculated, but if this was completely transparent that would help, and if the index was open to anyone to copy, or mirror in real time, that would allow for the ‘seeding’ of alternative approaches. A more powerful search dashboard would also help, so that if one was willing to wait a few more seconds, results could be ordered on different metrics.

I was referring to the way I was intending to use the SAFE address look-up, which is very pinpointed if I understand correctly. If we go to the address which is a hash of a search term, the next door addresses won’t be related in any way. This is vital for speed in my design. A tree structure would allow more fuzzy look-up from what I understand, but might as well try to use what’s already there!

I’d imagine that a version of this in the way you’re describing would be that your index would actually just be a formula: eg. Hash of (Search Term) + Josh, to save iterating over lists of search terms, addresses etc.

Again, not necessarily any speed benefit in this for you compared to the big index.

Apologies if it sounds like I’m digging my heels in. Still happy to be persuaded, and like I say I want to get a consensus because any solution will need support going forwards, but I feel like with the fully decentralised version we get on shaky ground as soon as it gets to specifics.

joshuef · December 10, 2019, 3:04pm

No worries! More writing for more folk. I need to get back into such things myself tbh!

Some thoughts:

Quality reward should come from GET rewards as part of pay the producer IMO (yeh I still hold out hope for this idea). So that’s not something any search system should worry about IMO.

What would the scammers be doing? Bad indexes won’t be used. If they are making useful indexes then they aren’t scammers?

Bookmarks are an example of this.

I’d also imagine there to be some kudos about having the best index on a given topic. You only have to look at StackOverflow to see how far community driven ranking can take something.

Hmm depends really. If it’s your data a browser could augment an index with frequency of visits… time spent etc.

It would be useful to understand what complexity you’re seeing here. But I assume some folk will combine / make the effort to properly index, all these sub / specific indexes. And then you could be periodically pulling these multi-indexes locally, if you want to run searches locally, or querying larger systems for results.

What keeps it open is that the indexes themselves remain open (in order to get rewards), so building a competing index gets a lot easier… And the index you’re using becomes detached from algo used for searching or indeed UI.

There will likely be some big players. But I’d imagine (perhaps naiively / optimistically), that the incentives for this GETs or paid search make it quite competitive, while keeping barrier to entry low, with public indexes (and folk also being paid for their use, mind)

No apologies needed! I should say the same and that I’m happy to be convinced too.

If I understand Octopus, it’s essentially:

Indexing is a paid for activity. (paid worries me: how do i know ranking is being bought?)
Indexes are made public.
Indexes are pre-ranked (by some ‘transparent’ method ? I’m not sure what/how this would be transparent or guaranteed to have been applied?)
Search is hash(term) → index for that term?

Please correct me if I’m missing something, or indeed plain wrong here.

In general I think any system will be stronger if Indexing is separate from ranking. And ranking is separate from search. That way it doesn’t matter if the index is voluntary / auto-compiled from browser + web of trust or paid paid for.

The algo you could choose something which is a pre-parsed index with your desired ranking algo applied (could be paid for too). Or it could be download and search locally with your own aglo also.

Key things being: indexes can be easily made / built upon. Other’s indexes (thus giving rewards) garnering rewards from PUT.

Ranking could be a service or done on the fly.

Searches dont have to go to a third party. But that could be desirable for some searches.

(Sorry if I’m repeating myself there… just trying to get down some vague ‘tenets of search’, that might help us qualify an approach)

happybeing · December 10, 2019, 3:18pm

I think we can have a good go at this by focusing on baby steps. We’re not aiming for Google, but a path towards building structures that make it easy to find stuff. So forget search engines and look back at how web search evolved.

I wrote something about this a few weeks or months ago but can’t find it now, but I recall that it was something along the lines:

people would find things by various means (word of mouth, lists of lists, curated topic lists etc). Individuals or organisations created lists for others to find things on their sites, these were shared and lists of these became bigger indexes, and it was fun collecting these and building bigger and better lists. Enough people will do this with no incentive except the desire to create. So that’s stage one - simple random sharing of lists made by others. We still have this today with stuff the search engines don’t index. This is a great opportunity for app developers, to build things that help people create and share lists in formats that other apps can consume, and in turn generate nice websites at the push of a button.
more curated indexes. I remember there was one semi-official list of websites by category, sub category etc etc. It may still exist but I don’t recall the name. Anyone wanting to make their website discoverable would submit a listing and someone would review that and hopefully add your website to the big ‘official’ list.
alongside that came simple crawler based search, AltaVista and I think Yahoo! More centralised, but still many engines, gradually improving and growing until Google invented the ‘page rank’ algorithm and the monopoly struck. This made all the other approaches obsolete and most engines lost almost all their traffic to Google and here we are.

With SAFE we are back at square one, so let’s start with that and assume enough people will just build and publish their own lists. Others will curate them, and curated lists will be curated and repackaged in various forms including searchable indexes, and no doubt there will soon be crawlers and even some clever ‘page rank’ stuff.

Now we can do better than this, but I think it’s a fair assumption that we’ll have the above to build on. We don’t need to incentivise everyone, but we can build engines that reward those who do this work. Just don’t reward them to the extent it is worth gaming.

That’s where I’d start. Build tools to help these processes, and to help pull indexes published in say, RDF together and will get more tools and new ways to do this without a monopoly.

Find ways to include the web of trust for topic based indexes, topic blacklists, topic whitelists or ranking etc and things can improve further.

Focus on things that promote decentralised control and wider distribution of rewards to reduce incentives towards monopoly and corruption.

Now I see Josh has also replied, TLDR: what @joshuef said!

Oh and just so it doesn’t get lost:

david-beinn · December 10, 2019, 3:50pm

That explains quite a lot! Perhaps it all hinges on that.

Not the best choice of word maybe by me, but I was thinking of falsely rating sites, for other benefits. Possibly opens up a market for bribery? Not quite so confident as you that bad indexes just wouldn’t be used. Clickbait apparently pays now, so why not on SAFE?

Definitely agree this can be used, but I’m thinking of those pages that are useful, but nobody exactly wants to bookmark, which arguably constitute the majority of our web browsing.

I’m just thinking really in terms of aggregating decent results in an acceptable amount of time. Questions like how do we know what group of indexes to belong to, or assign an index to? How to select the appropriate group to search first? If someone is aggregating them professionally, then surely we end up creeping back towards centralisation? How to avoid the web being completely segregated into bubbles? How do we know when to stop searching? Are the best results going to have been returned in that time, especially for obscure queries? How do we quickly disseminate something such as a news story? (the speed factor is probably why Wikinews never worked.)

On the plus side though, you’re maybe right that algorithms might be less necessary.

Do you mean NOT being bought?

Either way, anyone would be able to look at the site info AD for any given site, and see how the ratings metrics were changing.

Different types of ratings info would be coming in from search users, and it would be possible by sampling to check that this was being applied how Octopus said it would be to the eventual rankings.

In a broader way, the idea would be that Octopus would operate in the way that eg. a non-profit does - It has a mission statement that it adheres to. If it deviates from that we can see what’s happening, and in worst case scenario all the indexes and algorithms are open source anyway, so someone else can pick up where it’s left off.

To read, but not write.

The idea would be that they would dynamically be ranked, based on the various ratings metrics in the diagram. This is where it gets difficult technically, but no reason it wouldn’t remain easily verifiable.

That’s right, eventually with some kind of middle stage of stemming/search expansion based on the same principal.

I’d agree, and this is where I started off from. When we get to big numbers though, it seems the only efficient way to pull something decent out of the hat, without downloading it or asking a centralised super computer to do the work, is to keep it in some kind of ranked order. If we allow say 50 bytes per entry, an index can very quickly get too big to be realistically downloaded (hence the limit on my intermediate level design.)

david-beinn · December 10, 2019, 3:53pm

Yes, this would be a good idea I think. I feel like all approaches are compromised, and it makes sense to think of what compromises are most acceptable.

david-beinn · December 10, 2019, 4:13pm

Hard to disagree with any of this Mark.

My only point would be that people obviously have very high expectations these days, and the reason why Google established its monopoly was because they did things in a way that people preferred at that time, and may well prefer again.

My thinking was that by fixing some of the most broken aspects of the Google model (which is not too difficult) but keeping some of the aspects in a different form, we can tick both boxes.

I worry that if we are too idealistic we make the network vulnerable to being defined by the terms of another giant.

happybeing · December 10, 2019, 4:32pm

I agree and don’t mean to point you in any other direction. My approach was to suggest that we can start simple and move towards your goal rather than try to tackle the bigger issue from the start.

So when you were puzzling over how to reward and keep that from being gamed, I think to start with you need not worry about rewards at all. And if Pay the Producer is implemented it might well be enough anyway, at least for now, as Josh suggested.

Maybe we can tackle this at both ends and meet in the middle! Keep at it. Genius is about sticking with problems much longer than anyone else (according to Einstein, and I think he’s a reliable source ).

david-beinn · December 10, 2019, 4:43pm

Thanks Mark.

The funny thing is my whole approach came not so much from trying to tackle the big problem as saying ‘where can I go next,’ and then suddenly realising with how easily you can generate an address from a search term, this very basic design had the potential to be fairly scalable.

I’ll keep thinking though, and see if I can come up with any different ideas. I probably won’t have the chance now until after Christmas to sit down and actually make anything anyway.

joshuef · December 10, 2019, 6:07pm

Fair point and good to be wary of.

How is that not what octupus services are doing w/ paid entry into indexes? I’m not clear on the distinction here.

Something that may be worth thinking on is what is ‘good enough’. Google is fast, and mostly returns relevant results. but not always. We’re never going to be perfect. But how far can we get while maintaining acceptable user experience? (acceptable results and decent query time. Google is nigh on instant… but it wasn’t always. Maybe folk will accept a few seconds to query for their independence? )

Good questions.

Yeh, not being bought, or being able to tell if it is being bought (and so to discard that).

But does that give them a clear view? Does the indexer not have an incentive to pretend to be unbiased but actually allow paid results… Look at google w/ ads that look so similar to actual results. Or indeed newspaper advertorials. It gets slippy.

(That said, you have 0 way of knowing this for ‘curated lists’ either. So maybe this is moot for now)

Thanks for the latter clarifications. Especially around the Octopus non-profit stuff. That helps some with the ideas behind I think.

It’s late for me know. But aye, I will keep stewing, @david-beinn! , you’ve given me some good things to think about here!

Topic		Replies	Views
Indexing public data on safe Apps	3	1071	January 16, 2017
Safe Search 2023 Apps	12	624	February 11, 2024
Ideas for SAFE Search? Features	3	617	April 2, 2018
Free Targeted Advertising for Safe Network until June 2021 on PRESEARCH Marketing	24	862	March 7, 2021
SAFE web Standards Marketing	1	674	February 25, 2016

SAFE Search App

Related Topics