Safe-Search, bringing content discovery to the SAFE network

AndyAlban · February 1, 2018, 1:57pm

Description
I have begun the development of a SafeNet crawler and search indexer which will form the backbone for a SafeNet search engine application that will allow users to discover public content.

Motivation
Allowing users to Search for content will be advantageous to users of the SafeNet especially with the impending release of Safe-CMS, since that allows content creation and this will allow content discovery.

Progress so far:
We have created a basic web crawler which follows links it finds on web pages to discover other web pages, domains, etc. At the moment it’s searching the HTTP internet (as prototyping it in this way was more simple while we work on CLI libraries for interfacing with the SafeNet). This is rudimentary, of course, but it’s a good first step. Here is a little demo:

In addition, we’ve drawn up an application diagram detailing which software already exists and what we still need to build, in simple terms (though we’ve privately gone into much more depth) we need to build the following:

A multi-threaded SafeNet crawler which generates and stores a structured graph of nodes representing the SafeNet pages, with directed edges representing links (familiar to those with experience of graph theory)
A process which filters, prioritizes and generally manages the queue of uncrawled URLs, as well as maintaining an internal list of recently crawled, redundant and malicious URLs
A process which takes the structured graph and calculates “page-rank” of each node based on content, inherited page-rank from in-edges, etc. and stores the search indexes on the SafeNet as mutable data to be consumed by:
An app which downloads the compressed indexes (as needed) and allows locally searching SafeNet websites.

Stuff which already exists and will be utilized:

The queues will be simple priority weighted key/value REDIS instances
The storage for the graph structure will initially be done in MySql but once complexity increases (and the size of the index as the SafeNet becomes more popular ramps up) we’ll likely look at a more suitable tool - we’re trying to minimize complexity for the original release.

The indexing and crawling process will (for now) run on the (gigabit-fiber connected) home PC which @shane will be using to run a permanent vault when the network officially goes live.

Cost of updating the public search index will (hopefully) come from the AppDeveloper wallet attached to the SafeCMS, but I’m estimating fairly low costs for storage of the compressed indexes, so if Safe-CMS doesn’t make any Safecoin, it’s not a big expense.

Estimated release date for this is some time in early-mid April - this seems like quite a long way off, but a lot of the work is waiting on @shane to finish up on Safe-CMS so he can dedicate extra time to completing this.

Co-authors:
Shane Armstrong (@shane)
Software engineer with 6 years experience of building massively scale-able web applications in Rust and PHP5/7. Author of SafeCMS.

Andy Alban (@AndyAlban)
Frontend software developer with experience building React applications and PHP7 server backends.

neo · February 1, 2018, 2:07pm

Great work by the looks of it. Excellent.

Have you considered if there is any way to be able to crawl sites that are not linked from another. Basically they are sites that people visit by word of mouth or from other such sites that form a closed network of sites. In other words there is no starting point of links into these sites or closed set of sites.

In the traditional web the Registrars provide a list of new domain names registered for search engines to crawl and for others to update their records (or for info purposes)

But in SAFE the user registers his (Decentralised Name) themselves and no one is notified or can be found. The SAFE DNS is not searchable since you need to know the name in order to get the SAFE DNS record.

One way I thought of is to have defined a protocol for users to announce their SAFE DNS name for anyone who wished to know. So obviously there would be a simple APP to implement the protocol so that the user can easily do this. Then search engines can pick up on the announcements to add to their crawlers.

Oh another question.

Will your crawlers be decentralised or will they require “servers” (computers) left on 24/7 to do this crawling.

I would love to see it decentralised and basically whomever uses the search APP also does some of the crawling work.

Shane · February 1, 2018, 2:11pm

We HAVE planned for this, as it happens, great question - this was one of our initial concerns! The plan is to create an MDATA directory which anyone has public READ/INSERT permissions for, so people can advertise their websites in a way where the data won’t “belong” to us. I don’t want us to build a walled garden where we own all the data and SafeNet users are trapped with us. I’ll be building optional support for this in to the domain/service creation steps of V2 of the Safe-CMS. In essence, this directory will be the very first registrar of the SafeNet.

In terms of compute, the plan is to migrate this to compute and pay people who opt-in for the resources once compute is supported by the SafeNet, but in the interim, it’ll have to be centralised and I’ll be covering the cost of this personally.

neo · February 1, 2018, 2:18pm

Of course you could make an APP that people can run that does say x amount of crawling. You then have an account for each and once they reach enough to pay them you transfer the coin to their wallet.

Or you could make it that whenever someone does a search that the search APP crawls a couple of pages as well. Thus they effectively pay for the use of the search APP by doing a little bit of crawling.

The plan is to have PtD rewards where the developer of an Application is paid some reward every time people use the APP. (Based on the number of “GETS” the APP does). About 10% of what the farmers get for those “GETS”

Shane · February 1, 2018, 2:26pm

Yep, I totally agree, but I was wary of attaching conditions to usage of the search app - right now, in the clear web, search is “free” (unless you count viewing ads). Since our only personal costs will be uploading the indexes (which will be fairly small) and the size of the SafeNet is going to be tiny at the start, I filed this under increased complexity.

It’s always easy as a developer to make promises you can’t keep, myself and Andrew have something of a more pragmatic view of the apps we build: we’ll only build something if we’re happy we can deliver it.

Nigel · February 1, 2018, 2:33pm

SAFE needed someone working on search so bad, thanks guys. I think you’ll do really well in terms of PtD rewards! Bravo

VaCrunch · February 1, 2018, 2:36pm

You guys seem to really get it. The SAFE world is lucky to have you!

anon78865233 · February 1, 2018, 2:39pm

is Safe-fs building a search engine for SafeNet?

happybeing · February 1, 2018, 2:42pm

@AndyAlban this is great news

I’d like to suggest you and @Shane consider adoption of LinkedData/RDF for content storage and public indexing, and schema.org use in content creation.

Using LinkedData makes the separation between app and data much wider - opening up the content to more and different kinds of app, at the same time as enhancing the value of the content by adding semantics and RDF rules (for validation I think).

I suggest at some point you (and others) could get together and talk to the Solid team to get their thoughts on how to do this - one now works on the Qwant search engine.

Meanwhile I’m making a portability layer that will allow SAFE and SOLID apps to work on either platform if they use LDP (Linked Data Platform) for storage operations. There’s an API ready to look at when you want, but I haven’t implemented the storage operations yet.

UPDATE: Related to my suggestion that developers consider LinkedD/RDF for increasing interoperability and usefulness of the data created by SAFE apps, this just appeared in the Solid chat: Solid Application Data Discovery

github.com

solid/solid/blob/main/proposals/data-discovery.md#listed-type-index

# Solid Application Data Discovery

The following describes application data discovery in the Solid framework
using HTTP link following aka "follow your nose". This should not be
confused with [Application Configuration
Discovery](https://github.com/solid/solid/tree/master/proposals/app-discovery.md)
or [Storage
Discovery](https://github.com/solid/solid-spec/blob/master/solid-webid-profiles.md#storage-discovery).

**Note:** The Type Registry is mainly intended as a Library discovery mechanism.
We recommend that coarse-grained library types are registered (usually types
that match containers as opposed to every RDF Class written by an app).

Specifically, the Type Registry provides a way for a client application to
discover where a user keeps data relevant to this app, without either:

a. Prompting the user to select the location of every relevant instance or
  container, or
b. Scanning through the entire dataspace/root storage of the user.

This file has been truncated. show original

AndyAlban · February 1, 2018, 3:05pm

@anon78865233 Hi, I believe they are, but content discovery is such an important feature for the network that it really needs some level of choice and competition, given the massive scope.

Ultimately, I think that our goals and Safe-FS’ goals are diametrically opposed, they’re building a company around their offerings (which is fine, developers need to eat!) and we’re just trying to provide useful tools for the community in a free, unencumbered, public way.

Our indexes will all be public, as will our “domain registry”, so the Safe-FS team (and any other team) are more than welcome to consume those, and hopefully the community gets something out of having multiple teams working on this problem in parallel.

AndyAlban · February 1, 2018, 3:05pm

Thanks for this, we’ll take a look at these resources and see where it fits in to / enhances our current plans.

anon78865233 · February 1, 2018, 3:09pm

Nice approach. I can see this type of work ethic making SafeNet a very successful project indeed.

BIGbtc · February 1, 2018, 3:11pm

Dynamic Duo @AndyAlban @Shane This is a top application and will definitely drive adoption. As @neo points out, getting to a decentralized model is crucial.

Great work guys.

jlpell · February 1, 2018, 3:27pm

I would be curious to get your opinion, and @Shane’s, on my plan/suggestion for someone (MaidSafe? Us? We? Me?) to philanthropically dictionary attack the network in order to setup a “public reserve” for single word and proper name domains so that domain squatting is minimized and access is granted for Everyone to using these common language names, forever. I suppose you can think of is as an analogy for “public reserves” for plants and wildlife that are setup in order to protect endangered species from poachers. This would also give you an initial known set of seeds from which your crawler could expand out from. I would say MaidSafe or MaidSafe Foundation are the preferred entities to actually do this, but if they don’t I think anyone who agrees with the general idea should ban together and form a charity/foundation to get it or something like it implemented. I suppose the only creative alternative would be some kind of pet name system that you might be able to work into SafeCMS and the Safe Crawler.

Shane · February 1, 2018, 3:43pm

If I’m being truly honest, I’m against any sort of decentralisation attack.

SafeNet is being designed first and foremost for freedom and anti-censorship. If Maidsafe have control over a central domain name structure, they also have control over the service domains under it, since Maidsafe are subject to both British and European law, as well as foreign copyright laws respected by the EU, this provides a simple method of censoring things: Simply threaten to sue Maidsafe until they remove the offending service from their domain.

I think that we really need to break away from this concept of only certain TLDs being valuable or usable, the (not-so) recent expansion of GTLDs by the ICANN organisation is a good example of this, these days a “.travel” domain is just as valid and discoverable as a “.com” domain.

jlpell · February 1, 2018, 3:48pm

That’s not how the safe dns works. Public ID controls the base safe://shane, or safe://i-am.shane
anyone could have safe://shane.the-guy-who-made-safecms. And the general idea is not something that MaidSafe would be in control of, just something that would initially be set aside for public use like a wiki page but more natural/better/safer. But this is off-topic for this thread, I just wanted to mention it briefly.

anon40790172 · February 1, 2018, 4:18pm

Really amazing work! So great we get more an more projects started on SAFE. One group working on identity-management, the others on an app-store, blog tools and now on search.

dirvine · February 1, 2018, 10:25pm

This cannot work by design. I would have the same issues you have if this were possible. In fact all that can be demanded is we stop work on it, but that means prison/exile for me as I will not stop. The teams are remote and I would hope they keep going vie community funds or some mechanism, perhaps somebody would release maidsafecoin to them, I am pretty sure they would

In any case we cannot stop anything on the network. If anybody did see something where we could then it would be removed. We should all watch for such things, just in case.

riddim · February 1, 2018, 11:24pm

Maybe we even want to not use any longer a unique naming system …?

There is a pretty old topic about a petname system that i like a lot to be honest

A petname system obviously only makes sense if you can share your name space with your friends and connect them - if you can send another person a link that is valid on both ends…

… Large benefit would be that I would be able to visit your blog by going to safe://shane no matter how many other shanes are on this planet and there couldn’t be Domain squatting…
But yes its very different to what people are used to see in the internet…

Just thought I’d mention it - maybe it’s something you might think about some moments

Ps:

You know what we had some interesting proposals back then - I’ll just throw in another idea by @Seneca

Pretty interesting as well - if you search the forum we had other ideas too (but those two left the deepest impressions in my memory)

AndyAlban · February 5, 2018, 11:00am

Since most of the work on this project is fairly heavy back-end code, the updates will most likely be fewer and further between than on Safe-CMS, but we have a little update here with some mock-ups we’ve done for the design. We’ve gone as simple as possible, without any spots for adverts or extraneous information - people will come to the app to find something and we should make that journey as quick and easy as possible.

This is the home page of the app (after the SAFE-authentication cycle is complete, but it will use the same pre-loading screen as the Safe-CMS project does):

This is the search results page, obviously you’ll get more than 4 results per page, it’s purely an example:

Our next major development update will be in around 2 weeks where we will be posting things like application diagrams, details about how the crawler will work, rate limiting, support for robots.txt and sitemap.xml, etc.

Thanks all, @AndyAlban & @Shane.

Topic		Replies	Views
Google-like searches on Safe Network Apps	31	3343	January 23, 2018
Brainstorming decentralized search on Safe Features	28	1127	September 21, 2020
SAFE Search App Apps	190	5259	February 18, 2021
Simple Human-centric Search on SAFE Apps	13	1702	July 3, 2016
SAFESearch - Search Engine Apps	55	6956	February 16, 2018

Safe-Search, bringing content discovery to the SAFE network

Related topics