Description
I have begun the development of a SafeNet crawler and search indexer which will form the backbone for a SafeNet search engine application that will allow users to discover public content.
Motivation
Allowing users to Search for content will be advantageous to users of the SafeNet especially with the impending release of Safe-CMS, since that allows content creation and this will allow content discovery.
Progress so far:
We have created a basic web crawler which follows links it finds on web pages to discover other web pages, domains, etc. At the moment itās searching the HTTP internet (as prototyping it in this way was more simple while we work on CLI libraries for interfacing with the SafeNet). This is rudimentary, of course, but itās a good first step. Here is a little demo:
In addition, weāve drawn up an application diagram detailing which software already exists and what we still need to build, in simple terms (though weāve privately gone into much more depth) we need to build the following:
- A multi-threaded SafeNet crawler which generates and stores a structured graph of nodes representing the SafeNet pages, with directed edges representing links (familiar to those with experience of graph theory)
- A process which filters, prioritizes and generally manages the queue of uncrawled URLs, as well as maintaining an internal list of recently crawled, redundant and malicious URLs
- A process which takes the structured graph and calculates āpage-rankā of each node based on content, inherited page-rank from in-edges, etc. and stores the search indexes on the SafeNet as mutable data to be consumed by:
- An app which downloads the compressed indexes (as needed) and allows locally searching SafeNet websites.
Stuff which already exists and will be utilized:
- The queues will be simple priority weighted key/value REDIS instances
- The storage for the graph structure will initially be done in MySql but once complexity increases (and the size of the index as the SafeNet becomes more popular ramps up) weāll likely look at a more suitable tool - weāre trying to minimize complexity for the original release.
The indexing and crawling process will (for now) run on the (gigabit-fiber connected) home PC which @shane will be using to run a permanent vault when the network officially goes live.
Cost of updating the public search index will (hopefully) come from the AppDeveloper wallet attached to the SafeCMS, but Iām estimating fairly low costs for storage of the compressed indexes, so if Safe-CMS doesnāt make any Safecoin, itās not a big expense.
Estimated release date for this is some time in early-mid April - this seems like quite a long way off, but a lot of the work is waiting on @shane to finish up on Safe-CMS so he can dedicate extra time to completing this.
Co-authors:
Shane Armstrong (@shane)
Software engineer with 6 years experience of building massively scale-able web applications in Rust and PHP5/7. Author of SafeCMS.
Andy Alban (@AndyAlban)
Frontend software developer with experience building React applications and PHP7 server backends.